Class OfficeParser

java.lang.Object
org.apache.tika.parser.microsoft.AbstractOfficeParser
org.apache.tika.parser.microsoft.OfficeParser
All Implemented Interfaces:
Serializable, Parser

public class OfficeParser extends AbstractOfficeParser
Defines a Microsoft document content extractor.
See Also:
  • Constructor Details

    • OfficeParser

      public OfficeParser()
  • Method Details

    • extractMacros

      public static void extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) throws IOException, SAXException
      Helper to extract macros from an NPOIFS/vbaProject.bin

      As of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions

      Parameters:
      fs - NPOIFS to extract from
      xhtml - SAX writer
      embeddedDocumentExtractor - extractor for embedded documents
      Throws:
      IOException - on IOException if it occurs during the extraction of the embedded doc
      SAXException - on SAXException for writing to xhtml
    • getSupportedTypes

      public Set<MediaType> getSupportedTypes(ParseContext context)
      Description copied from interface: Parser
      Returns the set of media types supported by this parser when used with the given parse context.
      Parameters:
      context - parse context
      Returns:
      immutable set of media types
    • parse

      public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
      Extracts properties and text from an MS Document input stream
      Parameters:
      stream - the document stream (input)
      handler - handler for the XHTML SAX events (output)
      metadata - document metadata (input and output)
      context - parse context
      Throws:
      IOException - if the document stream could not be read
      SAXException - if the SAX events could not be processed
      TikaException - if the document could not be parsed
    • parse

      protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException
      Throws:
      IOException
      SAXException
      TikaException
    • getUCEntry

      public static org.apache.poi.poifs.filesystem.Entry getUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget)
      Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTarget
      Parameters:
      root -
      ucTarget -
      Returns: