Package org.apache.tika.extractor
Class EmbeddedDocumentUtil
java.lang.Object
org.apache.tika.extractor.EmbeddedDocumentUtil
- All Implemented Interfaces:
Serializable
Utility class to handle common issues with embedded documents.
Use statically if all that is needed is getting the EmbeddedDocumentExtractor.
Otherwise, instantiate an instance.
Note: This is not thread safe. Make sure to instantiate one per thread.
- See Also:
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic EmbeddedDocumentExtractor
getEmbeddedDocumentExtractor
(ParseContext context) This offers a uniform way to get an EmbeddedDocumentExtractor from a ParseContext.getExtension
(TikaInputStream is, Metadata metadata) static Parser
getStatelessParser
(ParseContext context) Utility function to get the Parser that was sent in to the ParseContext to handle embedded documents.void
parseEmbedded
(InputStream inputStream, ContentHandler handler, Metadata metadata, boolean outputHtml) static void
static void
recordException
(Throwable t, Metadata m) boolean
static Parser
tryToFindExistingLeafParser
(Class clazz, ParseContext context) Tries to find an existing parser within the ParseContext.
-
Constructor Details
-
EmbeddedDocumentUtil
-
-
Method Details
-
getEmbeddedDocumentExtractor
This offers a uniform way to get an EmbeddedDocumentExtractor from a ParseContext. As of Tika 1.15, an AutoDetectParser will automatically be added to parse embedded documents if no Parser.class is specified in the ParseContext. If you'd prefer not to parse embedded documents, set Parser.class toEmptyParser
in the ParseContext.- Parameters:
context
-- Returns:
- EmbeddedDocumentExtractor
-
getStatelessParser
Utility function to get the Parser that was sent in to the ParseContext to handle embedded documents. If it is stateful, unwrap it to get its stateless delegating parser.If there is no Parser in the parser context, this will return null.
- Parameters:
context
-- Returns:
-
getPasswordProvider
-
getDetector
-
getMimeTypes
-
getTikaConfig
- Returns:
- Returns a
TikaConfig
-- trying to find it first in the ParseContext that was included during initialization, and then creating a new one from viaTikaConfig.getDefaultConfig()
if it can't find one in the ParseContext. This caches the default config so that it only has to be created once.
-
getExtension
-
recordException
-
recordEmbeddedStreamException
-
shouldParseEmbedded
-
parseEmbedded
public void parseEmbedded(InputStream inputStream, ContentHandler handler, Metadata metadata, boolean outputHtml) throws IOException, SAXException - Throws:
IOException
SAXException
-
tryToFindExistingLeafParser
Tries to find an existing parser within the ParseContext. It looks inside of CompositeParsers and ParserDecorators. The use case is when a parser needs to parse an internal stream that is _part_ of the document, e.g. rtf body inside an msg. Can returnnull
if the context contains no parser or the correct parser can't be found.- Parameters:
clazz
- parser class to search forcontext
-- Returns:
-