Package org.apache.tika.extractor
Class EmbeddedDocumentUtil
java.lang.Object
org.apache.tika.extractor.EmbeddedDocumentUtil
- All Implemented Interfaces:
Serializable
Utility class to handle common issues with embedded documents.
Use statically if all that is needed is getting the EmbeddedDocumentExtractor.
Otherwise, instantiate an instance.
Note: This is not thread safe. Make sure to instantiate one per thread.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumType of embedded resource, used for generating canonical resource names. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic StringgenerateResourceName(EmbeddedDocumentUtil.EmbeddedResourcePrefix type, int count, String mediaType) Generates a canonical resource name from a type, counter, and media type.static EmbeddedDocumentExtractorgetEmbeddedDocumentExtractor(ParseContext context) This offers a uniform way to get an EmbeddedDocumentExtractor from a ParseContext.getExtension(TikaInputStream is, Metadata metadata) static StringgetExtensionForMediaType(String mediaType) static ParsergetStatelessParser(ParseContext context) Utility function to get the Parser that was sent in to the ParseContext to handle embedded documents.static StringnormalizeMediaType(String mediaType) Normalizes internal OCR routing media types (e.g.,image/ocr-png) back to standard media types (e.g.,image/png).voidparseEmbedded(TikaInputStream tis, ContentHandler handler, Metadata metadata, boolean outputHtml) static voidstatic voidrecordException(Throwable t, Metadata m) static voidsetGeneratedResourceName(Metadata metadata, EmbeddedDocumentUtil.EmbeddedResourcePrefix type, int count, String mediaType) Sets a generated resource name on the metadata and marks the extension as inferred.booleanstatic ParsertryToFindExistingLeafParser(Class clazz, ParseContext context) Tries to find an existing parser within the ParseContext.
-
Constructor Details
-
EmbeddedDocumentUtil
-
-
Method Details
-
getEmbeddedDocumentExtractor
This offers a uniform way to get an EmbeddedDocumentExtractor from a ParseContext. As of Tika 1.15, an AutoDetectParser will automatically be added to parse embedded documents if no Parser.class is specified in the ParseContext. If you'd prefer not to parse embedded documents, set Parser.class toEmptyParserin the ParseContext.- Parameters:
context-- Returns:
- EmbeddedDocumentExtractor
-
getStatelessParser
Utility function to get the Parser that was sent in to the ParseContext to handle embedded documents. If it is stateful, unwrap it to get its stateless delegating parser.If there is no Parser in the parser context, this will return null.
- Parameters:
context-- Returns:
-
getPasswordProvider
-
getDetector
-
getMimeTypes
-
getExtension
-
normalizeMediaType
Normalizes internal OCR routing media types (e.g.,image/ocr-png) back to standard media types (e.g.,image/png). Returns the input unchanged if it is not an OCR routing type.- Parameters:
mediaType- the media type string- Returns:
- the normalized media type string, or the original if no normalization needed
-
getExtensionForMediaType
-
generateResourceName
public static String generateResourceName(EmbeddedDocumentUtil.EmbeddedResourcePrefix type, int count, String mediaType) Generates a canonical resource name from a type, counter, and media type. For example:generateResourceName(EmbeddedResourcePrefix.EMBEDDED, 0, "image/png")returns"embedded-0.png".- Parameters:
type- the embedded resource typecount- the counter valuemediaType- the media type string, or null if unknown- Returns:
- the generated resource name with extension
-
setGeneratedResourceName
public static void setGeneratedResourceName(Metadata metadata, EmbeddedDocumentUtil.EmbeddedResourcePrefix type, int count, String mediaType) Sets a generated resource name on the metadata and marks the extension as inferred.- Parameters:
metadata- the metadata to updatetype- the embedded resource typecount- the counter valuemediaType- the media type string, or null if unknown
-
recordException
-
recordEmbeddedStreamException
-
shouldParseEmbedded
-
parseEmbedded
public void parseEmbedded(TikaInputStream tis, ContentHandler handler, Metadata metadata, boolean outputHtml) throws IOException, SAXException - Throws:
IOExceptionSAXException
-
tryToFindExistingLeafParser
Tries to find an existing parser within the ParseContext. It looks inside of CompositeParsers and ParserDecorators. The use case is when a parser needs to parse an internal stream that is _part_ of the document, e.g. rtf body inside an msg. Can returnnullif the context contains no parser or the correct parser can't be found.- Parameters:
clazz- parser class to search forcontext-- Returns:
-