Package org.apache.tika.parser.microsoft
Class ExcelExtractor
- java.lang.Object
-
- org.apache.tika.parser.microsoft.ExcelExtractor
-
public class ExcelExtractor extends Object
Excel parser implementation which uses POI's Event API to handle the contents of a Workbook. The Event API uses a much smaller memory footprint thanHSSFWorkbook
when processing excel files but at the cost of more complexity. With the Event API a listener is registered for specific record types and those records are created, fired off to the listener and then discarded as the stream is being processed.- See Also:
HSSFListener
, POI Event API How To
-
-
Field Summary
Fields Modifier and Type Field Description protected ParseContext
context
protected OfficeParserConfig
officeParserConfig
protected Metadata
parentMetadata
-
Constructor Summary
Constructors Constructor Description ExcelExtractor(ParseContext context, Metadata metadata)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected Detector
getDetector()
protected String
getPassword()
Returns the password to be used for this file, or null if no / default password should be usedprotected TikaConfig
getTikaConfig()
protected void
handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, XHTMLContentHandler xhtml, boolean outputHtml)
Handle an office document that's embedded at the POIFS levelprotected void
handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, XHTMLContentHandler xhtml, boolean outputHtml)
Handle an office document that's embedded at the POIFS levelprotected void
handleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml)
protected void
handleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml)
protected void
handleEmbeddedResource(TikaInputStream resource, Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml)
boolean
isListenForAllRecords()
Returnstrue
if this parser is configured to listen for all records instead of just the specified few.protected void
parse(org.apache.poi.poifs.filesystem.DirectoryNode root, XHTMLContentHandler xhtml, Locale locale)
protected void
parse(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, XHTMLContentHandler xhtml, Locale locale)
Extracts text from an Excel Workbook writing the extracted content to the specifiedAppendable
.void
setListenForAllRecords(boolean listenForAllRecords)
Specifies whether this parser should to listen for all records or just for the specified few.
-
-
-
Field Detail
-
parentMetadata
protected final Metadata parentMetadata
-
officeParserConfig
protected final OfficeParserConfig officeParserConfig
-
context
protected final ParseContext context
-
-
Constructor Detail
-
ExcelExtractor
public ExcelExtractor(ParseContext context, Metadata metadata)
-
-
Method Detail
-
isListenForAllRecords
public boolean isListenForAllRecords()
Returnstrue
if this parser is configured to listen for all records instead of just the specified few.
-
setListenForAllRecords
public void setListenForAllRecords(boolean listenForAllRecords)
Specifies whether this parser should to listen for all records or just for the specified few. Note: Under normal operation this setting should befalse
(the default), but you can experiment with this setting for testing and debugging purposes.- Parameters:
listenForAllRecords
-true
if the HSSFListener should be registered to listen for all records orfalse
if the listener should be configured to only receive specified records.
-
parse
protected void parse(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, TikaException
Extracts text from an Excel Workbook writing the extracted content to the specifiedAppendable
.- Parameters:
filesystem
- POI file system- Throws:
IOException
- if an error occurs processing the workbook or writing the extracted contentSAXException
TikaException
-
parse
protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, TikaException
- Throws:
IOException
SAXException
TikaException
-
getTikaConfig
protected TikaConfig getTikaConfig()
-
getDetector
protected Detector getDetector()
-
getPassword
protected String getPassword()
Returns the password to be used for this file, or null if no / default password should be used
-
handleEmbeddedResource
protected void handleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException
- Throws:
IOException
SAXException
TikaException
-
handleEmbeddedResource
protected void handleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException
- Throws:
IOException
SAXException
TikaException
-
handleEmbeddedResource
protected void handleEmbeddedResource(TikaInputStream resource, Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException
- Throws:
IOException
SAXException
TikaException
-
handleEmbeddedOfficeDoc
protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException
Handle an office document that's embedded at the POIFS level- Throws:
IOException
SAXException
TikaException
-
handleEmbeddedOfficeDoc
protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException
Handle an office document that's embedded at the POIFS level- Throws:
IOException
SAXException
TikaException
-
-