java.lang.Object

org.apache.tika.parser.microsoft.AbstractOfficeParser

org.apache.tika.parser.microsoft.OfficeParser

All Implemented Interfaces:: Serializable, Parser

public class OfficeParser extends AbstractOfficeParser

Defines a Microsoft document content extractor.

See Also:

Serialized Form

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

OfficeParser.POIFSDocumentType
Constructor Summary

Constructors

Constructor

Description

OfficeParser()
Method Summary

Modifier and Type

Method

Description

static void

extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor)

Helper to extract macros from an NPOIFS/vbaProject.bin

Set<MediaType>

getSupportedTypes(ParseContext context)

Returns the set of media types supported by this parser when used with the given parse context.

static org.apache.poi.poifs.filesystem.Entry

getUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget)

Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTarget

void

parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)

Extracts properties and text from an MS Document input stream

protected void

parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml)

Methods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getByteArrayMaxOverride, getDateFormatOverride, isConcatenatePhoneticRuns, isExtractAllAlternativesFromMSG, isExtractMacros, isIncludeDeletedContent, isIncludeHeadersAndFooters, isIncludeMoveFromContent, isIncludeShapeBasedContent, isUseSAXDocxExtractor, isUseSAXPptxExtractor, setByteArrayMaxOverride, setConcatenatePhoneticRuns, setDateFormatOverride, setExtractAllAlternativesFromMSG, setExtractMacros, setIncludeDeletedContent, setIncludeHeadersAndFooters, setIncludeMoveFromContent, setIncludeShapeBasedContent, setUseSAXDocxExtractor, setUseSAXPptxExtractor

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- OfficeParser
  
  public OfficeParser()
Method Details
- extractMacros
  
  public static void extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) throws IOException, SAXException
  
  Helper to extract macros from an NPOIFS/vbaProject.bin
  As of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions
  
  Parameters:
  
  fs - NPOIFS to extract from
  
  xhtml - SAX writer
  
  embeddedDocumentExtractor - extractor for embedded documents
  
  Throws:
  
  IOException - on IOException if it occurs during the extraction of the embedded doc
  
  SAXException - on SAXException for writing to xhtml
- getSupportedTypes
  
  public Set<MediaType> getSupportedTypes(ParseContext context)
  
  Description copied from interface: Parser
  
  Returns the set of media types supported by this parser when used with the given parse context.
  
  Parameters:
  
  context - parse context
  
  Returns:
  
  immutable set of media types
- parse
  
  public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
  
  Extracts properties and text from an MS Document input stream
  
  Parameters:
  
  stream - the document stream (input)
  
  handler - handler for the XHTML SAX events (output)
  
  metadata - document metadata (input and output)
  
  context - parse context
  
  Throws:
  
  IOException - if the document stream could not be read
  
  SAXException - if the SAX events could not be processed
  
  TikaException - if the document could not be parsed
- parse
  
  protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException
  
  Throws:
  
  IOException
  
  SAXException
  
  TikaException
- getUCEntry
  
  public static org.apache.poi.poifs.filesystem.Entry getUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget)
  
  Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTarget
  
  Parameters:
  
  root -
  
  ucTarget -
  
  Returns:

Class OfficeParser

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser

Methods inherited from class java.lang.Object

Constructor Details

OfficeParser

Method Details

extractMacros

getSupportedTypes

parse

parse

getUCEntry