Class OOXMLWordAndPowerPointTextHandler
java.lang.Object
org.xml.sax.helpers.DefaultHandler
org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler
- All Implemented Interfaces:
- ContentHandler,- DTDHandler,- EntityResolver,- ErrorHandler
This class is intended to handle anything that might contain IBodyElements:
 main document, headers, footers, notes, slides, etc.
 
This class does not generally check for namespaces, and it can be applied to PPTX and DOCX for text extraction.
This can be used to scrape content from charts. It currently ignores formula (<c:f/>) elements
This does not work with .xlsx or .vsdx.
TODO: move this into POI?
- 
Nested Class SummaryNested ClassesModifier and TypeClassDescriptionstatic enumstatic interface
- 
Field SummaryFields
- 
Constructor SummaryConstructorsConstructorDescriptionOOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks) OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns) 
- 
Method SummaryModifier and TypeMethodDescriptionvoidcharacters(char[] ch, int start, int length) voidvoidendElement(String uri, String localName, String qName) voidendPrefixMapping(String prefix) voidignorableWhitespace(char[] ch, int start, int length) voidvoidstartElement(String uri, String localName, String qName, Attributes atts) voidstartPrefixMapping(String prefix, String uri) Methods inherited from class org.xml.sax.helpers.DefaultHandlererror, fatalError, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, unparsedEntityDecl, warningMethods inherited from class java.lang.Objectclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.xml.sax.ContentHandlerdeclaration
- 
Field Details- 
W_NS- See Also:
 
 
- 
- 
Constructor Details- 
OOXMLWordAndPowerPointTextHandlerpublic OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks) 
- 
OOXMLWordAndPowerPointTextHandlerpublic OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns) 
 
- 
- 
Method Details- 
startDocument- Specified by:
- startDocumentin interface- ContentHandler
- Overrides:
- startDocumentin class- DefaultHandler
- Throws:
- SAXException
 
- 
endDocument- Specified by:
- endDocumentin interface- ContentHandler
- Overrides:
- endDocumentin class- DefaultHandler
- Throws:
- SAXException
 
- 
startPrefixMapping- Specified by:
- startPrefixMappingin interface- ContentHandler
- Overrides:
- startPrefixMappingin class- DefaultHandler
- Throws:
- SAXException
 
- 
endPrefixMapping- Specified by:
- endPrefixMappingin interface- ContentHandler
- Overrides:
- endPrefixMappingin class- DefaultHandler
- Throws:
- SAXException
 
- 
startElementpublic void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException - Specified by:
- startElementin interface- ContentHandler
- Overrides:
- startElementin class- DefaultHandler
- Throws:
- SAXException
 
- 
endElement- Specified by:
- endElementin interface- ContentHandler
- Overrides:
- endElementin class- DefaultHandler
- Throws:
- SAXException
 
- 
characters- Specified by:
- charactersin interface- ContentHandler
- Overrides:
- charactersin class- DefaultHandler
- Throws:
- SAXException
 
- 
ignorableWhitespace- Specified by:
- ignorableWhitespacein interface- ContentHandler
- Overrides:
- ignorableWhitespacein class- DefaultHandler
- Throws:
- SAXException
 
 
-