Class SXWPFWordExtractorDecorator
- java.lang.Object
-
- org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
-
- org.apache.tika.parser.microsoft.ooxml.SXWPFWordExtractorDecorator
-
- All Implemented Interfaces:
OOXMLExtractor
public class SXWPFWordExtractorDecorator extends AbstractOOXMLExtractor
This is an experimental, alternative extractor for docx files. This streams the main document content rather than loading the full document into memory.This will be better for some use cases than the classic docx extractor; and, it will be worse for others.
- Since:
- 1.15
-
-
Field Summary
-
Fields inherited from class org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
config, EMBEDDED_RELATIONSHIPS, extractor
-
-
Constructor Summary
Constructors Constructor Description SXWPFWordExtractorDecorator(Metadata metadata, ParseContext context, XWPFEventBasedWordExtractor extractor)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
buildXHTML(XHTMLContentHandler xhtml)
Populates theXHTMLContentHandler
object received as parameter.protected List<org.apache.poi.openxml4j.opc.PackagePart>
getMainDocumentParts()
This returns all items that might contain embedded objects: main document, headers, footers, comments, etc.-
Methods inherited from class org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
getDocument, getEmbeddedPartMetadataMap, getJustFileName, getMetadataExtractor, getXHTML, handleEmbeddedFile, loadLinkedRelationships
-
-
-
-
Constructor Detail
-
SXWPFWordExtractorDecorator
public SXWPFWordExtractorDecorator(Metadata metadata, ParseContext context, XWPFEventBasedWordExtractor extractor)
-
-
Method Detail
-
buildXHTML
protected void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, org.apache.xmlbeans.XmlException, IOException
Description copied from class:AbstractOOXMLExtractor
Populates theXHTMLContentHandler
object received as parameter.- Specified by:
buildXHTML
in classAbstractOOXMLExtractor
- Throws:
SAXException
org.apache.xmlbeans.XmlException
IOException
-
getMainDocumentParts
protected List<org.apache.poi.openxml4j.opc.PackagePart> getMainDocumentParts()
This returns all items that might contain embedded objects: main document, headers, footers, comments, etc.- Specified by:
getMainDocumentParts
in classAbstractOOXMLExtractor
-
-