Class AbstractOOXMLExtractor
- java.lang.Object
-
- org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
-
- All Implemented Interfaces:
OOXMLExtractor
- Direct Known Subclasses:
POIXMLTextExtractorDecorator
,SXSLFPowerPointExtractorDecorator
,SXWPFWordExtractorDecorator
,XPSExtractorDecorator
,XSLFPowerPointExtractorDecorator
,XSSFExcelExtractorDecorator
,XWPFWordExtractorDecorator
public abstract class AbstractOOXMLExtractor extends Object implements OOXMLExtractor
Base class for all Tika OOXML extractors. Tika extractors decorate POI extractors so that the parsed content of documents is returned as a sequence of XHTML SAX events. Subclasses must implement the buildXHTML methodbuildXHTML(XHTMLContentHandler)
that populates theXHTMLContentHandler
object received as parameter.
-
-
Field Summary
Fields Modifier and Type Field Description protected OfficeParserConfig
config
protected static String[]
EMBEDDED_RELATIONSHIPS
protected org.apache.poi.ooxml.extractor.POIXMLTextExtractor
extractor
-
Constructor Summary
Constructors Constructor Description AbstractOOXMLExtractor(ParseContext context, org.apache.poi.ooxml.extractor.POIXMLTextExtractor extractor)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected abstract void
buildXHTML(XHTMLContentHandler xhtml)
Populates theXHTMLContentHandler
object received as parameter.org.apache.poi.ooxml.POIXMLDocument
getDocument()
Returns the opened document.protected Map<String,EmbeddedPartMetadata>
getEmbeddedPartMetadataMap()
protected String
getJustFileName(String desc)
protected abstract List<org.apache.poi.openxml4j.opc.PackagePart>
getMainDocumentParts()
Return a list of the main parts of the document, used when searching for embedded resources.MetadataExtractor
getMetadataExtractor()
POIXMLTextExtractor.getMetadataTextExtractor()
not yet supported for OOXML by POI.void
getXHTML(ContentHandler handler, Metadata metadata, ParseContext context)
Parses the document into a sequence of XHTML SAX events sent to the given content handler.protected void
handleEmbeddedFile(org.apache.poi.openxml4j.opc.PackagePart part, XHTMLContentHandler xhtml, String rel, EmbeddedPartMetadata embeddedPartMetadata, TikaCoreProperties.EmbeddedResourceType embeddedResourceType)
Handles an embedded file in the documentprotected Map<String,String>
loadLinkedRelationships(org.apache.poi.openxml4j.opc.PackagePart bodyPart, boolean includeInternal, Metadata metadata)
This is used by the SAX docx and pptx decorators to load hyperlinks and other linked objects
-
-
-
Field Detail
-
EMBEDDED_RELATIONSHIPS
protected static final String[] EMBEDDED_RELATIONSHIPS
-
config
protected OfficeParserConfig config
-
extractor
protected org.apache.poi.ooxml.extractor.POIXMLTextExtractor extractor
-
-
Constructor Detail
-
AbstractOOXMLExtractor
public AbstractOOXMLExtractor(ParseContext context, org.apache.poi.ooxml.extractor.POIXMLTextExtractor extractor)
-
-
Method Detail
-
getDocument
public org.apache.poi.ooxml.POIXMLDocument getDocument()
Description copied from interface:OOXMLExtractor
Returns the opened document.- Specified by:
getDocument
in interfaceOOXMLExtractor
- See Also:
OOXMLExtractor.getDocument()
-
getMetadataExtractor
public MetadataExtractor getMetadataExtractor()
Description copied from interface:OOXMLExtractor
POIXMLTextExtractor.getMetadataTextExtractor()
not yet supported for OOXML by POI.- Specified by:
getMetadataExtractor
in interfaceOOXMLExtractor
- See Also:
OOXMLExtractor.getMetadataExtractor()
-
getXHTML
public void getXHTML(ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException, org.apache.xmlbeans.XmlException, IOException, TikaException
Description copied from interface:OOXMLExtractor
Parses the document into a sequence of XHTML SAX events sent to the given content handler.- Specified by:
getXHTML
in interfaceOOXMLExtractor
- Throws:
SAXException
org.apache.xmlbeans.XmlException
IOException
TikaException
- See Also:
OOXMLExtractor.getXHTML(ContentHandler, Metadata, ParseContext)
-
getEmbeddedPartMetadataMap
protected Map<String,EmbeddedPartMetadata> getEmbeddedPartMetadataMap()
-
handleEmbeddedFile
protected void handleEmbeddedFile(org.apache.poi.openxml4j.opc.PackagePart part, XHTMLContentHandler xhtml, String rel, EmbeddedPartMetadata embeddedPartMetadata, TikaCoreProperties.EmbeddedResourceType embeddedResourceType) throws SAXException, IOException
Handles an embedded file in the document- Throws:
SAXException
IOException
-
buildXHTML
protected abstract void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, org.apache.xmlbeans.XmlException, IOException
Populates theXHTMLContentHandler
object received as parameter.- Throws:
SAXException
org.apache.xmlbeans.XmlException
IOException
-
getMainDocumentParts
protected abstract List<org.apache.poi.openxml4j.opc.PackagePart> getMainDocumentParts() throws TikaException
Return a list of the main parts of the document, used when searching for embedded resources. This should be all the parts of the document that end up with things embedded into them.- Throws:
TikaException
-
-