Class OOXMLWordAndPowerPointTextHandler

  • All Implemented Interfaces:
    ContentHandler, DTDHandler, EntityResolver, ErrorHandler

    public class OOXMLWordAndPowerPointTextHandler
    extends DefaultHandler
    This class is intended to handle anything that might contain IBodyElements: main document, headers, footers, notes, slides, etc.

    This class does not generally check for namespaces, and it can be applied to PPTX and DOCX for text extraction.

    This can be used to scrape content from charts. It currently ignores formula (<c:f/>) elements

    This does not work with .xlsx or .vsdx.

    TODO: move this into POI?