Package org.apache.tika.parser.microsoft
Class OfficeParser
java.lang.Object
org.apache.tika.parser.AbstractParser
org.apache.tika.parser.microsoft.AbstractOfficeParser
org.apache.tika.parser.microsoft.OfficeParser
- All Implemented Interfaces:
Serializable
,Parser
Defines a Microsoft document content extractor.
- See Also:
-
Nested Class Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic void
extractMacros
(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) Helper to extract macros from an NPOIFS/vbaProject.bingetSupportedTypes
(ParseContext context) Returns the set of media types supported by this parser when used with the given parse context.static org.apache.poi.poifs.filesystem.Entry
getUCEntry
(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget) Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTargetvoid
parse
(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) Extracts properties and text from an MS Document input streamprotected void
parse
(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) Methods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getByteArrayMaxOverride, getDateFormatOverride, isConcatenatePhoneticRuns, isExtractAllAlternativesFromMSG, isExtractMacros, isIncludeDeletedContent, isIncludeHeadersAndFooters, isIncludeMoveFromContent, isIncludeShapeBasedContent, isUseSAXDocxExtractor, isUseSAXPptxExtractor, setByteArrayMaxOverride, setConcatenatePhoneticRuns, setDateFormatOverride, setExtractAllAlternativesFromMSG, setExtractMacros, setIncludeDeletedContent, setIncludeHeadersAndFooters, setIncludeMoveFromContent, setIncludeShapeBasedContent, setUseSAXDocxExtractor, setUseSAXPptxExtractor
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
Constructor Details
-
OfficeParser
public OfficeParser()
-
-
Method Details
-
extractMacros
public static void extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) throws IOException, SAXException Helper to extract macros from an NPOIFS/vbaProject.binAs of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions
- Parameters:
fs
- NPOIFS to extract fromxhtml
- SAX writerembeddedDocumentExtractor
- extractor for embedded documents- Throws:
IOException
- on IOException if it occurs during the extraction of the embedded docSAXException
- on SAXException for writing to xhtml
-
getSupportedTypes
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Extracts properties and text from an MS Document input stream- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
IOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-
parse
protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException - Throws:
IOException
SAXException
TikaException
-
getUCEntry
public static org.apache.poi.poifs.filesystem.Entry getUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget) Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTarget- Parameters:
root
-ucTarget
-- Returns:
-