Package org.apache.tika.parser.microsoft
Class OfficeParser
java.lang.Object
org.apache.tika.parser.microsoft.AbstractOfficeParser
org.apache.tika.parser.microsoft.OfficeParser
- All Implemented Interfaces:
Serializable,SelfConfiguring,Parser
Defines a Microsoft document content extractor.
- See Also:
-
Nested Class Summary
Nested Classes -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidextractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor, ParseContext context) Helper to extract macros from an NPOIFS/vbaProject.bingetSupportedTypes(ParseContext context) Returns the set of media types supported by this parser when used with the given parse context.static org.apache.poi.poifs.filesystem.EntrygetUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget) Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTargetprotected voidparse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) voidparse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext context) Extracts properties and text from an MS Document input streamMethods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getByteArrayMaxOverride, getDefaultConfig, setByteArrayMaxOverride, setDefaultOfficeParserConfig
-
Constructor Details
-
OfficeParser
public OfficeParser() -
OfficeParser
-
OfficeParser
-
-
Method Details
-
extractMacros
public static void extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor, ParseContext context) throws IOException, SAXException, TikaException Helper to extract macros from an NPOIFS/vbaProject.binAs of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions
- Parameters:
fs- NPOIFS to extract fromxhtml- SAX writerembeddedDocumentExtractor- extractor for embedded documentscontext- parse context for creating metadata- Throws:
IOException- on IOException if it occurs during the extraction of the embedded docSAXException- on SAXException for writing to xhtmlTikaException
-
getSupportedTypes
Description copied from interface:ParserReturns the set of media types supported by this parser when used with the given parse context.- Parameters:
context- parse context- Returns:
- immutable set of media types
-
parse
public void parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Extracts properties and text from an MS Document input streamhandler- handler for the XHTML SAX events (output)metadata- document metadata (input and output)context- parse context- Throws:
IOException- if the document stream could not be readSAXException- if the SAX events could not be processedTikaException- if the document could not be parsed
-
parse
protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException - Throws:
IOExceptionSAXExceptionTikaException
-
getUCEntry
public static org.apache.poi.poifs.filesystem.Entry getUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget) Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTarget- Parameters:
root-ucTarget-- Returns:
-