Class OOXMLWordAndPowerPointTextHandler
java.lang.Object
org.xml.sax.helpers.DefaultHandler
org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler
- All Implemented Interfaces:
ContentHandler
,DTDHandler
,EntityResolver
,ErrorHandler
This class is intended to handle anything that might contain IBodyElements:
main document, headers, footers, notes, slides, etc.
This class does not generally check for namespaces, and it can be applied to PPTX and DOCX for text extraction.
This can be used to scrape content from charts. It currently ignores formula (<c:f/>) elements
This does not work with .xlsx or .vsdx.
TODO: move this into POI?
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic enum
static interface
-
Field Summary
-
Constructor Summary
ConstructorDescriptionOOXMLWordAndPowerPointTextHandler
(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks) OOXMLWordAndPowerPointTextHandler
(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns) -
Method Summary
Modifier and TypeMethodDescriptionvoid
characters
(char[] ch, int start, int length) void
void
endElement
(String uri, String localName, String qName) void
endPrefixMapping
(String prefix) void
ignorableWhitespace
(char[] ch, int start, int length) void
void
startElement
(String uri, String localName, String qName, Attributes atts) void
startPrefixMapping
(String prefix, String uri) Methods inherited from class org.xml.sax.helpers.DefaultHandler
error, fatalError, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, unparsedEntityDecl, warning
-
Field Details
-
W_NS
- See Also:
-
-
Constructor Details
-
OOXMLWordAndPowerPointTextHandler
public OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks) -
OOXMLWordAndPowerPointTextHandler
public OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String, String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns)
-
-
Method Details
-
startDocument
- Specified by:
startDocument
in interfaceContentHandler
- Overrides:
startDocument
in classDefaultHandler
- Throws:
SAXException
-
endDocument
- Specified by:
endDocument
in interfaceContentHandler
- Overrides:
endDocument
in classDefaultHandler
- Throws:
SAXException
-
startPrefixMapping
- Specified by:
startPrefixMapping
in interfaceContentHandler
- Overrides:
startPrefixMapping
in classDefaultHandler
- Throws:
SAXException
-
endPrefixMapping
- Specified by:
endPrefixMapping
in interfaceContentHandler
- Overrides:
endPrefixMapping
in classDefaultHandler
- Throws:
SAXException
-
startElement
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException - Specified by:
startElement
in interfaceContentHandler
- Overrides:
startElement
in classDefaultHandler
- Throws:
SAXException
-
endElement
- Specified by:
endElement
in interfaceContentHandler
- Overrides:
endElement
in classDefaultHandler
- Throws:
SAXException
-
characters
- Specified by:
characters
in interfaceContentHandler
- Overrides:
characters
in classDefaultHandler
- Throws:
SAXException
-
ignorableWhitespace
- Specified by:
ignorableWhitespace
in interfaceContentHandler
- Overrides:
ignorableWhitespace
in classDefaultHandler
- Throws:
SAXException
-