Class ParsingEmbeddedDocumentExtractor

java.lang.Object
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor
All Implemented Interfaces:
EmbeddedDocumentExtractor
Direct Known Subclasses:
UnpackExtractor

public class ParsingEmbeddedDocumentExtractor extends Object implements EmbeddedDocumentExtractor
Helper class for parsers of package archives or other compound document formats that support embedded or attached component documents.
Since:
Apache Tika 0.8
  • Field Details

  • Constructor Details

    • ParsingEmbeddedDocumentExtractor

      public ParsingEmbeddedDocumentExtractor(ParseContext context)
  • Method Details

    • shouldParseEmbedded

      public boolean shouldParseEmbedded(Metadata metadata)
      Description copied from interface: EmbeddedDocumentExtractor
      Determines whether the given embedded document should be parsed.

      Note: Implementations may throw EmbeddedLimitReachedException (a RuntimeException) if a limit is exceeded and throwing is configured.

      Specified by:
      shouldParseEmbedded in interface EmbeddedDocumentExtractor
      Parameters:
      metadata - the metadata for the embedded document
      Returns:
      true if the embedded document should be parsed
    • checkEmbeddedLimits

      protected boolean checkEmbeddedLimits(ParseRecord parseRecord)
      Checks embedded document limits from ParseRecord.

      If throwOnMaxDepth or throwOnMaxCount is configured and the respective limit is hit, an EmbeddedLimitReachedException is thrown. Otherwise, returns false and sets the appropriate limit flag on the ParseRecord.

      Note: The count limit is a hard stop (once hit, no more embedded docs are parsed). The depth limit only affects documents at that depth - sibling documents at shallower depths will still be parsed.

      Subclasses that override parseEmbedded() should call this method to enforce limits.

      Parameters:
      parseRecord - the parse record to check
      Returns:
      true if the embedded document should be parsed, false if limits are exceeded
      Throws:
      EmbeddedLimitReachedException - if a limit is exceeded and throwing is configured
    • parseEmbedded

      public void parseEmbedded(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext parseContext, boolean outputHtml) throws SAXException, IOException
      Description copied from interface: EmbeddedDocumentExtractor
      Processes the supplied embedded resource, calling the delegating parser with the appropriate details.
      Specified by:
      parseEmbedded in interface EmbeddedDocumentExtractor
      Parameters:
      tis - The embedded resource
      handler - The handler to use
      metadata - The metadata for the embedded resource
      parseContext - The parse context
      outputHtml - Should we output HTML for this resource, or has the parser already done so?
      Throws:
      SAXException
      IOException
    • recordException

      protected void recordException(Exception e, ParseContext context)
    • getDelegatingParser

      public Parser getDelegatingParser()
    • isWriteFileNameToContent

      public boolean isWriteFileNameToContent()
      Returns whether to write file names to content based on SAXOutputConfig in the ParseContext. Defaults to true if no config is present.
      Returns:
      true if file names should be written to content