Class AbstractOOXMLExtractor
java.lang.Object
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
- All Implemented Interfaces:
OOXMLExtractor
- Direct Known Subclasses:
SXSLFPowerPointExtractorDecorator,SXWPFWordExtractorDecorator,VSDXExtractorDecorator,XPSExtractorDecorator,XSSFExcelExtractorDecorator
Base class for all Tika OOXML extractors.
Tika extractors decorate POI extractors so that the parsed content of
documents is returned as a sequence of XHTML SAX events. Subclasses must
implement the buildXHTML method buildXHTML(XHTMLContentHandler) that
populates the XHTMLContentHandler object received as parameter.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected OfficeParserConfigprotected static final String[]protected org.apache.poi.openxml4j.opc.OPCPackage -
Constructor Summary
ConstructorsConstructorDescriptionAbstractOOXMLExtractor(ParseContext context, org.apache.poi.openxml4j.opc.OPCPackage opcPackage) -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract voidbuildXHTML(XHTMLContentHandler xhtml) Populates theXHTMLContentHandlerobject received as parameter.protected Map<String,EmbeddedPartMetadata> protected StringgetJustFileName(String desc) protected abstract List<org.apache.poi.openxml4j.opc.PackagePart>Return a list of the main parts of the document, used when searching for embedded resources.voidgetXHTML(ContentHandler handler, Metadata metadata, ParseContext context) Parses the document into a sequence of XHTML SAX events sent to the given content handler.protected voidhandleEmbeddedFile(org.apache.poi.openxml4j.opc.PackagePart part, XHTMLContentHandler xhtml, String rel, EmbeddedPartMetadata embeddedPartMetadata, TikaCoreProperties.EmbeddedResourceType embeddedResourceType) Handles an embedded file in the documentloadLinkedRelationships(org.apache.poi.openxml4j.opc.PackagePart bodyPart, boolean includeInternal, Metadata metadata) This is used by the SAX docx and pptx decorators to load hyperlinks and other linked objectsstatic org.apache.poi.openxml4j.opc.PackagePartsafeGetRelatedPart(org.apache.poi.openxml4j.opc.PackagePart source, org.apache.poi.openxml4j.opc.PackageRelationship relationship) Safely resolves a related part, returning null if the part cannot be found instead of throwingIllegalArgumentException.
-
Field Details
-
EMBEDDED_RELATIONSHIPS
-
config
-
opcPackage
protected org.apache.poi.openxml4j.opc.OPCPackage opcPackage
-
-
Constructor Details
-
AbstractOOXMLExtractor
public AbstractOOXMLExtractor(ParseContext context, org.apache.poi.openxml4j.opc.OPCPackage opcPackage)
-
-
Method Details
-
getMetadataExtractor
- Specified by:
getMetadataExtractorin interfaceOOXMLExtractor- See Also:
-
getXHTML
public void getXHTML(ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException, IOException, TikaException Description copied from interface:OOXMLExtractorParses the document into a sequence of XHTML SAX events sent to the given content handler.- Specified by:
getXHTMLin interfaceOOXMLExtractor- Throws:
SAXExceptionIOExceptionTikaException- See Also:
-
getEmbeddedPartMetadataMap
-
getJustFileName
-
handleEmbeddedFile
protected void handleEmbeddedFile(org.apache.poi.openxml4j.opc.PackagePart part, XHTMLContentHandler xhtml, String rel, EmbeddedPartMetadata embeddedPartMetadata, TikaCoreProperties.EmbeddedResourceType embeddedResourceType) throws SAXException, IOException, TikaException Handles an embedded file in the document- Throws:
SAXExceptionIOExceptionTikaException
-
buildXHTML
Populates theXHTMLContentHandlerobject received as parameter.- Throws:
SAXExceptionIOException
-
getMainDocumentParts
protected abstract List<org.apache.poi.openxml4j.opc.PackagePart> getMainDocumentParts() throws TikaExceptionReturn a list of the main parts of the document, used when searching for embedded resources. This should be all the parts of the document that end up with things embedded into them.- Throws:
TikaException
-
loadLinkedRelationships
protected Map<String,String> loadLinkedRelationships(org.apache.poi.openxml4j.opc.PackagePart bodyPart, boolean includeInternal, Metadata metadata) This is used by the SAX docx and pptx decorators to load hyperlinks and other linked objects- Parameters:
bodyPart-- Returns:
-
safeGetRelatedPart
public static org.apache.poi.openxml4j.opc.PackagePart safeGetRelatedPart(org.apache.poi.openxml4j.opc.PackagePart source, org.apache.poi.openxml4j.opc.PackageRelationship relationship) throws org.apache.poi.openxml4j.exceptions.InvalidFormatException Safely resolves a related part, returning null if the part cannot be found instead of throwingIllegalArgumentException.- Throws:
org.apache.poi.openxml4j.exceptions.InvalidFormatException
-