Class SXWPFWordExtractorDecorator
java.lang.Object
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
org.apache.tika.parser.microsoft.ooxml.SXWPFWordExtractorDecorator
- All Implemented Interfaces:
OOXMLExtractor
This is an experimental, alternative extractor for docx files.
This streams the main document content rather than loading the
full document into memory.
This will be better for some use cases than the classic docx extractor; and, it will be worse for others.
- Since:
- 1.15
-
Field Summary
Fields inherited from class org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
config, EMBEDDED_RELATIONSHIPS, extractor -
Constructor Summary
ConstructorsConstructorDescriptionSXWPFWordExtractorDecorator(Metadata metadata, ParseContext context, XWPFEventBasedWordExtractor extractor) -
Method Summary
Modifier and TypeMethodDescriptionprotected voidbuildXHTML(XHTMLContentHandler xhtml) Populates theXHTMLContentHandlerobject received as parameter.protected List<org.apache.poi.openxml4j.opc.PackagePart>This returns all items that might contain embedded objects: main document, headers, footers, comments, etc.Methods inherited from class org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
getDocument, getEmbeddedPartMetadataMap, getJustFileName, getMetadataExtractor, getXHTML, handleEmbeddedFile, loadLinkedRelationships
-
Constructor Details
-
SXWPFWordExtractorDecorator
public SXWPFWordExtractorDecorator(Metadata metadata, ParseContext context, XWPFEventBasedWordExtractor extractor)
-
-
Method Details
-
buildXHTML
protected void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, org.apache.xmlbeans.XmlException, IOException Description copied from class:AbstractOOXMLExtractorPopulates theXHTMLContentHandlerobject received as parameter.- Specified by:
buildXHTMLin classAbstractOOXMLExtractor- Throws:
SAXExceptionorg.apache.xmlbeans.XmlExceptionIOException
-
getMainDocumentParts
This returns all items that might contain embedded objects: main document, headers, footers, comments, etc.- Specified by:
getMainDocumentPartsin classAbstractOOXMLExtractor
-