Class OOXMLWordAndPowerPointTextHandler
java.lang.Object
org.xml.sax.helpers.DefaultHandler
org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler
- All Implemented Interfaces:
ContentHandler,DTDHandler,EntityResolver,ErrorHandler
This class is intended to handle anything that might contain IBodyElements:
main document, headers, footers, notes, slides, etc.
This class does not generally check for namespaces, and it can be applied to PPTX and DOCX for text extraction.
This can be used to scrape content from charts. It currently ignores formula (<c:f/>) elements
This does not work with .xlsx or .vsdx.
TODO: move this into POI?
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumstatic interface -
Field Summary
Fields -
Constructor Summary
ConstructorsConstructorDescriptionOOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks) OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns) -
Method Summary
Modifier and TypeMethodDescriptionvoidcharacters(char[] ch, int start, int length) voidvoidendElement(String uri, String localName, String qName) voidendPrefixMapping(String prefix) voidignorableWhitespace(char[] ch, int start, int length) voidvoidstartElement(String uri, String localName, String qName, Attributes atts) voidstartPrefixMapping(String prefix, String uri) Methods inherited from class org.xml.sax.helpers.DefaultHandler
error, fatalError, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, unparsedEntityDecl, warning
-
Field Details
-
W_NS
- See Also:
-
-
Constructor Details
-
OOXMLWordAndPowerPointTextHandler
public OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks) -
OOXMLWordAndPowerPointTextHandler
public OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns)
-
-
Method Details
-
startDocument
- Specified by:
startDocumentin interfaceContentHandler- Overrides:
startDocumentin classDefaultHandler- Throws:
SAXException
-
endDocument
- Specified by:
endDocumentin interfaceContentHandler- Overrides:
endDocumentin classDefaultHandler- Throws:
SAXException
-
startPrefixMapping
- Specified by:
startPrefixMappingin interfaceContentHandler- Overrides:
startPrefixMappingin classDefaultHandler- Throws:
SAXException
-
endPrefixMapping
- Specified by:
endPrefixMappingin interfaceContentHandler- Overrides:
endPrefixMappingin classDefaultHandler- Throws:
SAXException
-
startElement
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException - Specified by:
startElementin interfaceContentHandler- Overrides:
startElementin classDefaultHandler- Throws:
SAXException
-
endElement
- Specified by:
endElementin interfaceContentHandler- Overrides:
endElementin classDefaultHandler- Throws:
SAXException
-
characters
- Specified by:
charactersin interfaceContentHandler- Overrides:
charactersin classDefaultHandler- Throws:
SAXException
-
ignorableWhitespace
- Specified by:
ignorableWhitespacein interfaceContentHandler- Overrides:
ignorableWhitespacein classDefaultHandler- Throws:
SAXException
-