Class OOXMLWordAndPowerPointTextHandler
- java.lang.Object
-
- org.xml.sax.helpers.DefaultHandler
-
- org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler
-
- All Implemented Interfaces:
ContentHandler
,DTDHandler
,EntityResolver
,ErrorHandler
public class OOXMLWordAndPowerPointTextHandler extends DefaultHandler
This class is intended to handle anything that might contain IBodyElements: main document, headers, footers, notes, slides, etc. This class does not generally check for namespaces, and it can be applied to PPTX and DOCX for text extraction. This can be used to scrape content from charts. It currently ignores formula (<c:f/>) elements This does not work with .xlsx or .vsdx. TODO: move this into POI?
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
OOXMLWordAndPowerPointTextHandler.EditType
static interface
OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler
-
Constructor Summary
Constructors Constructor Description OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String,String> hyperlinks)
OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String,String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
characters(char[] ch, int start, int length)
void
endDocument()
void
endElement(String uri, String localName, String qName)
void
endPrefixMapping(String prefix)
void
ignorableWhitespace(char[] ch, int start, int length)
void
startDocument()
void
startElement(String uri, String localName, String qName, Attributes atts)
void
startPrefixMapping(String prefix, String uri)
-
Methods inherited from class org.xml.sax.helpers.DefaultHandler
error, fatalError, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, unparsedEntityDecl, warning
-
-
-
-
Field Detail
-
W_NS
public static final String W_NS
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
OOXMLWordAndPowerPointTextHandler
public OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String,String> hyperlinks)
-
OOXMLWordAndPowerPointTextHandler
public OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, Map<String,String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns)
-
-
Method Detail
-
startDocument
public void startDocument() throws SAXException
- Specified by:
startDocument
in interfaceContentHandler
- Overrides:
startDocument
in classDefaultHandler
- Throws:
SAXException
-
endDocument
public void endDocument() throws SAXException
- Specified by:
endDocument
in interfaceContentHandler
- Overrides:
endDocument
in classDefaultHandler
- Throws:
SAXException
-
startPrefixMapping
public void startPrefixMapping(String prefix, String uri) throws SAXException
- Specified by:
startPrefixMapping
in interfaceContentHandler
- Overrides:
startPrefixMapping
in classDefaultHandler
- Throws:
SAXException
-
endPrefixMapping
public void endPrefixMapping(String prefix) throws SAXException
- Specified by:
endPrefixMapping
in interfaceContentHandler
- Overrides:
endPrefixMapping
in classDefaultHandler
- Throws:
SAXException
-
startElement
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException
- Specified by:
startElement
in interfaceContentHandler
- Overrides:
startElement
in classDefaultHandler
- Throws:
SAXException
-
endElement
public void endElement(String uri, String localName, String qName) throws SAXException
- Specified by:
endElement
in interfaceContentHandler
- Overrides:
endElement
in classDefaultHandler
- Throws:
SAXException
-
characters
public void characters(char[] ch, int start, int length) throws SAXException
- Specified by:
characters
in interfaceContentHandler
- Overrides:
characters
in classDefaultHandler
- Throws:
SAXException
-
ignorableWhitespace
public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException
- Specified by:
ignorableWhitespace
in interfaceContentHandler
- Overrides:
ignorableWhitespace
in classDefaultHandler
- Throws:
SAXException
-
-