Class EmbeddedDocumentUtil

java.lang.Object
org.apache.tika.extractor.EmbeddedDocumentUtil
All Implemented Interfaces:
Serializable

public class EmbeddedDocumentUtil extends Object implements Serializable
Utility class to handle common issues with embedded documents.

Use statically if all that is needed is getting the EmbeddedDocumentExtractor. Otherwise, instantiate an instance.

Note: This is not thread safe. Make sure to instantiate one per thread.

See Also:
  • Constructor Details

    • EmbeddedDocumentUtil

      public EmbeddedDocumentUtil(ParseContext context)
  • Method Details

    • getEmbeddedDocumentExtractor

      public static EmbeddedDocumentExtractor getEmbeddedDocumentExtractor(ParseContext context)
      This offers a uniform way to get an EmbeddedDocumentExtractor from a ParseContext. As of Tika 1.15, an AutoDetectParser will automatically be added to parse embedded documents if no Parser.class is specified in the ParseContext.

      If you'd prefer not to parse embedded documents, set Parser.class to EmptyParser in the ParseContext.

      Parameters:
      context -
      Returns:
      EmbeddedDocumentExtractor
    • getStatelessParser

      public static Parser getStatelessParser(ParseContext context)
      Utility function to get the Parser that was sent in to the ParseContext to handle embedded documents. If it is stateful, unwrap it to get its stateless delegating parser.

      If there is no Parser in the parser context, this will return null.

      Parameters:
      context -
      Returns:
    • getPasswordProvider

      public PasswordProvider getPasswordProvider()
    • getDetector

      public Detector getDetector()
    • getMimeTypes

      public MimeTypes getMimeTypes()
    • getExtension

      public String getExtension(TikaInputStream is, Metadata metadata)
    • normalizeMediaType

      public static String normalizeMediaType(String mediaType)
      Normalizes internal OCR routing media types (e.g., image/ocr-png) back to standard media types (e.g., image/png). Returns the input unchanged if it is not an OCR routing type.
      Parameters:
      mediaType - the media type string
      Returns:
      the normalized media type string, or the original if no normalization needed
    • getExtensionForMediaType

      public static String getExtensionForMediaType(String mediaType)
    • generateResourceName

      public static String generateResourceName(EmbeddedDocumentUtil.EmbeddedResourcePrefix type, int count, String mediaType)
      Generates a canonical resource name from a type, counter, and media type. For example: generateResourceName(EmbeddedResourcePrefix.EMBEDDED, 0, "image/png") returns "embedded-0.png".
      Parameters:
      type - the embedded resource type
      count - the counter value
      mediaType - the media type string, or null if unknown
      Returns:
      the generated resource name with extension
    • setGeneratedResourceName

      public static void setGeneratedResourceName(Metadata metadata, EmbeddedDocumentUtil.EmbeddedResourcePrefix type, int count, String mediaType)
      Sets a generated resource name on the metadata and marks the extension as inferred.
      Parameters:
      metadata - the metadata to update
      type - the embedded resource type
      count - the counter value
      mediaType - the media type string, or null if unknown
    • recordException

      public static void recordException(Throwable t, Metadata m)
    • recordEmbeddedStreamException

      public static void recordEmbeddedStreamException(Throwable t, Metadata m)
    • shouldParseEmbedded

      public boolean shouldParseEmbedded(Metadata m)
    • parseEmbedded

      public void parseEmbedded(TikaInputStream tis, ContentHandler handler, Metadata metadata, boolean outputHtml) throws IOException, SAXException
      Throws:
      IOException
      SAXException
    • tryToFindExistingLeafParser

      public static Parser tryToFindExistingLeafParser(Class clazz, ParseContext context)
      Tries to find an existing parser within the ParseContext. It looks inside of CompositeParsers and ParserDecorators. The use case is when a parser needs to parse an internal stream that is _part_ of the document, e.g. rtf body inside an msg.

      Can return null if the context contains no parser or the correct parser can't be found.

      Parameters:
      clazz - parser class to search for
      context -
      Returns: