Package org.apache.tika.parser.external
Class ExternalParser
java.lang.Object
org.apache.tika.parser.external.ExternalParser
- All Implemented Interfaces:
Serializable
,Parser
- Direct Known Subclasses:
TensorflowImageRecParser
Parser that uses an external program (like catdoc or pdf2txt) to extract
text content and metadata from a given document.
- See Also:
-
Nested Class Summary
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic boolean
static boolean
Checks to see if the command can be run.String[]
Gets lines consumergetSupportedTypes
(ParseContext context) Returns the set of media types supported by this parser when used with the given parse context.void
parse
(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler.void
setCommand
(String... command) Sets the command to be run.void
setIgnoredLineConsumer
(ExternalParser.LineConsumer ignoredLineConsumer) Set a consumer for the lines ignored by the parse functionsvoid
setMetadataExtractionPatterns
(Map<Pattern, String> patterns) Sets the map of regular expression patterns and Metadata keys.void
setSupportedTypes
(Set<MediaType> supportedTypes)
-
Field Details
-
INPUT_FILE_TOKEN
The token, which if present in the Command string, will be replaced with the input filename. Alternately, the input data can be streamed over STDIN.- See Also:
-
OUTPUT_FILE_TOKEN
The token, which if present in the Command string, will be replaced with the output filename. Alternately, the output data can be collected on STDOUT.- See Also:
-
-
Constructor Details
-
ExternalParser
public ExternalParser()
-
-
Method Details
-
check
Checks to see if the command can be run. Typically used with something like "myapp --version" to check to see if "myapp" is installed and on the path.- Parameters:
checkCmd
- The check command to runerrorValue
- What is considered an error value?
-
check
-
getSupportedTypes
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Specified by:
getSupportedTypes
in interfaceParser
- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
getSupportedTypes
-
setSupportedTypes
-
getCommand
-
setCommand
Sets the command to be run. This can include either ofINPUT_FILE_TOKEN
orOUTPUT_FILE_TOKEN
if the command needs filenames.- See Also:
-
getIgnoredLineConsumer
Gets lines consumer- Returns:
- consumer instance
-
setIgnoredLineConsumer
Set a consumer for the lines ignored by the parse functions- Parameters:
ignoredLineConsumer
- consumer instance
-
getMetadataExtractionPatterns
-
setMetadataExtractionPatterns
Sets the map of regular expression patterns and Metadata keys. Any matching patterns will have the matching metadata entries set. Set this to null to disable Metadata extraction. -
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler. Metadata is only extracted ifsetMetadataExtractionPatterns(Map)
has been called to set patterns.- Specified by:
parse
in interfaceParser
- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
IOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-