Package org.apache.tika.extractor
Class ParsingEmbeddedDocumentExtractor
java.lang.Object
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor
- All Implemented Interfaces:
EmbeddedDocumentExtractor
- Direct Known Subclasses:
UnpackExtractor
Helper class for parsers of package archives or other compound document
formats that support embedded or attached component documents.
- Since:
- Apache Tika 0.8
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected booleancheckEmbeddedLimits(ParseRecord parseRecord) Checks embedded document limits from ParseRecord.booleanReturns whether to write file names to content based onSAXOutputConfigin the ParseContext.voidparseEmbedded(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext parseContext, boolean outputHtml) Processes the supplied embedded resource, calling the delegating parser with the appropriate details.protected voidrecordException(Exception e, ParseContext context) booleanshouldParseEmbedded(Metadata metadata) Determines whether the given embedded document should be parsed.
-
Field Details
-
context
-
-
Constructor Details
-
ParsingEmbeddedDocumentExtractor
-
-
Method Details
-
shouldParseEmbedded
Description copied from interface:EmbeddedDocumentExtractorDetermines whether the given embedded document should be parsed.Note: Implementations may throw
EmbeddedLimitReachedException(a RuntimeException) if a limit is exceeded and throwing is configured.- Specified by:
shouldParseEmbeddedin interfaceEmbeddedDocumentExtractor- Parameters:
metadata- the metadata for the embedded document- Returns:
- true if the embedded document should be parsed
-
checkEmbeddedLimits
Checks embedded document limits from ParseRecord.If throwOnMaxDepth or throwOnMaxCount is configured and the respective limit is hit, an EmbeddedLimitReachedException is thrown. Otherwise, returns false and sets the appropriate limit flag on the ParseRecord.
Note: The count limit is a hard stop (once hit, no more embedded docs are parsed). The depth limit only affects documents at that depth - sibling documents at shallower depths will still be parsed.
Subclasses that override parseEmbedded() should call this method to enforce limits.
- Parameters:
parseRecord- the parse record to check- Returns:
- true if the embedded document should be parsed, false if limits are exceeded
- Throws:
EmbeddedLimitReachedException- if a limit is exceeded and throwing is configured
-
parseEmbedded
public void parseEmbedded(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext parseContext, boolean outputHtml) throws SAXException, IOException Description copied from interface:EmbeddedDocumentExtractorProcesses the supplied embedded resource, calling the delegating parser with the appropriate details.- Specified by:
parseEmbeddedin interfaceEmbeddedDocumentExtractor- Parameters:
tis- The embedded resourcehandler- The handler to usemetadata- The metadata for the embedded resourceparseContext- The parse contextoutputHtml- Should we output HTML for this resource, or has the parser already done so?- Throws:
SAXExceptionIOException
-
recordException
-
getDelegatingParser
-
isWriteFileNameToContent
public boolean isWriteFileNameToContent()Returns whether to write file names to content based onSAXOutputConfigin the ParseContext. Defaults totrueif no config is present.- Returns:
- true if file names should be written to content
-