Package org.apache.tika.parser.microsoft
Class OfficeParser
- java.lang.Object
-
- org.apache.tika.parser.AbstractParser
-
- org.apache.tika.parser.microsoft.AbstractOfficeParser
-
- org.apache.tika.parser.microsoft.OfficeParser
-
- All Implemented Interfaces:
Serializable
,Parser
public class OfficeParser extends AbstractOfficeParser
Defines a Microsoft document content extractor.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
OfficeParser.POIFSDocumentType
-
Constructor Summary
Constructors Constructor Description OfficeParser()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static void
extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor)
Helper to extract macros from an NPOIFS/vbaProject.binSet<MediaType>
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used with the given parse context.static org.apache.poi.poifs.filesystem.Entry
getUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget)
Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTargetvoid
parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
Extracts properties and text from an MS Document input streamprotected void
parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml)
-
Methods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getByteArrayMaxOverride, getDateFormatOverride, isConcatenatePhoneticRuns, isExtractAllAlternativesFromMSG, isExtractMacros, isIncludeDeletedContent, isIncludeHeadersAndFooters, isIncludeMoveFromContent, isIncludeShapeBasedContent, isUseSAXDocxExtractor, isUseSAXPptxExtractor, setByteArrayMaxOverride, setConcatenatePhoneticRuns, setDateFormatOverride, setExtractAllAlternativesFromMSG, setExtractMacros, setIncludeDeletedContent, setIncludeHeadersAndFooters, setIncludeMoveFromContent, setIncludeShapeBasedContent, setUseSAXDocxExtractor, setUseSAXPptxExtractor
-
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
-
-
-
Method Detail
-
extractMacros
public static void extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) throws IOException, SAXException
Helper to extract macros from an NPOIFS/vbaProject.binAs of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions
- Parameters:
fs
- NPOIFS to extract fromxhtml
- SAX writerembeddedDocumentExtractor
- extractor for embedded documents- Throws:
IOException
- on IOException if it occurs during the extraction of the embedded docSAXException
- on SAXException for writing to xhtml
-
getSupportedTypes
public Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
Extracts properties and text from an MS Document input stream- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
IOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-
parse
protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException
- Throws:
IOException
SAXException
TikaException
-
getUCEntry
public static org.apache.poi.poifs.filesystem.Entry getUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget)
Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTarget- Parameters:
root
-ucTarget
-- Returns:
-
-