org.apache.tika.parser.microsoft.ooxml
Class AbstractOOXMLExtractor

java.lang.Object
  extended by org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
All Implemented Interfaces:
OOXMLExtractor
Direct Known Subclasses:
POIXMLTextExtractorDecorator, XSLFPowerPointExtractorDecorator, XSSFExcelExtractorDecorator, XWPFWordExtractorDecorator

public abstract class AbstractOOXMLExtractor
extends Object
implements OOXMLExtractor

Base class for all Tika OOXML extractors. Tika extractors decorate POI extractors so that the parsed content of documents is returned as a sequence of XHTML SAX events. Subclasses must implement the buildXHTML method buildXHTML(XHTMLContentHandler) that populates the XHTMLContentHandler object received as parameter.


Field Summary
protected  org.apache.poi.POIXMLTextExtractor extractor
           
 
Constructor Summary
AbstractOOXMLExtractor(ParseContext context, org.apache.poi.POIXMLTextExtractor extractor)
           
 
Method Summary
protected abstract  void buildXHTML(XHTMLContentHandler xhtml)
          Populates the XHTMLContentHandler object received as parameter.
 org.apache.poi.POIXMLDocument getDocument()
          Returns the opened document.
protected abstract  List<org.apache.poi.openxml4j.opc.PackagePart> getMainDocumentParts()
          Return a list of the main parts of the document, used when searching for embedded resources.
 MetadataExtractor getMetadataExtractor()
          POIXMLTextExtractor.getMetadataTextExtractor() not yet supported for OOXML by POI.
 void getXHTML(ContentHandler handler, Metadata metadata, ParseContext context)
          Parses the document into a sequence of XHTML SAX events sent to the given content handler.
protected  void handleEmbeddedFile(org.apache.poi.openxml4j.opc.PackagePart part, ContentHandler handler)
          Handles an embedded file in the document
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

extractor

protected org.apache.poi.POIXMLTextExtractor extractor
Constructor Detail

AbstractOOXMLExtractor

public AbstractOOXMLExtractor(ParseContext context,
                              org.apache.poi.POIXMLTextExtractor extractor)
Method Detail

getDocument

public org.apache.poi.POIXMLDocument getDocument()
Description copied from interface: OOXMLExtractor
Returns the opened document.

Specified by:
getDocument in interface OOXMLExtractor
See Also:
OOXMLExtractor.getDocument()

getMetadataExtractor

public MetadataExtractor getMetadataExtractor()
Description copied from interface: OOXMLExtractor
POIXMLTextExtractor.getMetadataTextExtractor() not yet supported for OOXML by POI.

Specified by:
getMetadataExtractor in interface OOXMLExtractor
See Also:
OOXMLExtractor.getMetadataExtractor()

getXHTML

public void getXHTML(ContentHandler handler,
                     Metadata metadata,
                     ParseContext context)
              throws SAXException,
                     org.apache.xmlbeans.XmlException,
                     IOException,
                     TikaException
Description copied from interface: OOXMLExtractor
Parses the document into a sequence of XHTML SAX events sent to the given content handler.

Specified by:
getXHTML in interface OOXMLExtractor
Throws:
SAXException
org.apache.xmlbeans.XmlException
IOException
TikaException
See Also:
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractor#getXHTML(org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata)

handleEmbeddedFile

protected void handleEmbeddedFile(org.apache.poi.openxml4j.opc.PackagePart part,
                                  ContentHandler handler)
                           throws SAXException,
                                  IOException
Handles an embedded file in the document

Throws:
SAXException
IOException

buildXHTML

protected abstract void buildXHTML(XHTMLContentHandler xhtml)
                            throws SAXException,
                                   org.apache.xmlbeans.XmlException,
                                   IOException
Populates the XHTMLContentHandler object received as parameter.

Throws:
SAXException
org.apache.xmlbeans.XmlException
IOException

getMainDocumentParts

protected abstract List<org.apache.poi.openxml4j.opc.PackagePart> getMainDocumentParts()
                                                                                throws TikaException
Return a list of the main parts of the document, used when searching for embedded resources. This should be all the parts of the document that end up with things embedded into them.

Throws:
TikaException


Copyright © 2007-2012 The Apache Software Foundation. All Rights Reserved.