Package org.apache.tika.parser.microsoft
Class ExcelExtractor
java.lang.Object
org.apache.tika.parser.microsoft.ExcelExtractor
Excel parser implementation which uses POI's Event API
to handle the contents of a Workbook.
The Event API uses a much smaller memory footprint than
HSSFWorkbook when processing excel files
but at the cost of more complexity.
With the Event API a listener is registered for
specific record types and those records are created,
fired off to the listener and then discarded as the stream
is being processed.- See Also:
-
HSSFListener- POI Event API How To
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final ParseContextprotected final OfficeParserConfigprotected final Metadata -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected Detectorprotected StringReturns the password to be used for this file, or null if no / default password should be usedprotected TikaConfigprotected voidhandleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, XHTMLContentHandler xhtml, boolean outputHtml) Handle an office document that's embedded at the POIFS levelprotected voidhandleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, Metadata metadata, String resourceName, XHTMLContentHandler xhtml, boolean outputHtml) Handle an office document that's embedded at the POIFS levelprotected voidhandleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, XHTMLContentHandler xhtml, boolean outputHtml) Handle an office document that's embedded at the POIFS levelprotected voidhandleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) protected voidhandleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) protected voidhandleEmbeddedResource(TikaInputStream resource, Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) booleanReturnstrueif this parser is configured to listen for all records instead of just the specified few.protected voidparse(org.apache.poi.poifs.filesystem.DirectoryNode root, XHTMLContentHandler xhtml, Locale locale) protected voidparse(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, XHTMLContentHandler xhtml, Locale locale) Extracts text from an Excel Workbook writing the extracted content to the specifiedAppendable.voidsetListenForAllRecords(boolean listenForAllRecords) Specifies whether this parser should to listen for all records or just for the specified few.static StringtryToGetMsgTitle(org.apache.poi.poifs.filesystem.DirectoryEntry node, String defaultVal)
-
Field Details
-
parentMetadata
-
officeParserConfig
-
context
-
-
Constructor Details
-
ExcelExtractor
-
-
Method Details
-
isListenForAllRecords
public boolean isListenForAllRecords()Returnstrueif this parser is configured to listen for all records instead of just the specified few. -
setListenForAllRecords
public void setListenForAllRecords(boolean listenForAllRecords) Specifies whether this parser should to listen for all records or just for the specified few. Note: Under normal operation this setting should befalse(the default), but you can experiment with this setting for testing and debugging purposes.- Parameters:
listenForAllRecords-trueif the HSSFListener should be registered to listen for all records orfalseif the listener should be configured to only receive specified records.
-
parse
protected void parse(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, TikaException Extracts text from an Excel Workbook writing the extracted content to the specifiedAppendable.- Parameters:
filesystem- POI file system- Throws:
IOException- if an error occurs processing the workbook or writing the extracted contentSAXExceptionTikaException
-
parse
protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, TikaException - Throws:
IOExceptionSAXExceptionTikaException
-
getTikaConfig
-
getDetector
-
getPassword
Returns the password to be used for this file, or null if no / default password should be used -
handleEmbeddedResource
protected void handleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException - Throws:
IOExceptionSAXExceptionTikaException
-
handleEmbeddedResource
protected void handleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException - Throws:
IOExceptionSAXExceptionTikaException
-
handleEmbeddedResource
protected void handleEmbeddedResource(TikaInputStream resource, Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException - Throws:
IOExceptionSAXExceptionTikaException
-
handleEmbeddedOfficeDoc
protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException Handle an office document that's embedded at the POIFS level- Throws:
IOExceptionSAXExceptionTikaException
-
handleEmbeddedOfficeDoc
protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException Handle an office document that's embedded at the POIFS level- Throws:
IOExceptionSAXExceptionTikaException
-
handleEmbeddedOfficeDoc
protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, Metadata metadata, String resourceName, XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, TikaException Handle an office document that's embedded at the POIFS level- Throws:
IOExceptionSAXExceptionTikaException
-
tryToGetMsgTitle
-