Package org.apache.tika.parser.external
Class ExternalParser
- java.lang.Object
-
- org.apache.tika.parser.AbstractParser
-
- org.apache.tika.parser.external.ExternalParser
-
- All Implemented Interfaces:
Serializable
,Parser
- Direct Known Subclasses:
TensorflowImageRecParser
public class ExternalParser extends AbstractParser
Parser that uses an external program (like catdoc or pdf2txt) to extract text content and metadata from a given document.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static interface
ExternalParser.LineConsumer
Consumer contract
-
Field Summary
Fields Modifier and Type Field Description static String
INPUT_FILE_TOKEN
The token, which if present in the Command string, will be replaced with the input filename.static String
OUTPUT_FILE_TOKEN
The token, which if present in the Command string, will be replaced with the output filename.
-
Constructor Summary
Constructors Constructor Description ExternalParser()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static boolean
check(String[] checkCmd, int... errorValue)
static boolean
check(String checkCmd, int... errorValue)
Checks to see if the command can be run.String[]
getCommand()
ExternalParser.LineConsumer
getIgnoredLineConsumer()
Gets lines consumerMap<Pattern,String>
getMetadataExtractionPatterns()
Set<MediaType>
getSupportedTypes()
Set<MediaType>
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used with the given parse context.void
parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler.void
setCommand(String... command)
Sets the command to be run.void
setIgnoredLineConsumer(ExternalParser.LineConsumer ignoredLineConsumer)
Set a consumer for the lines ignored by the parse functionsvoid
setMetadataExtractionPatterns(Map<Pattern,String> patterns)
Sets the map of regular expression patterns and Metadata keys.void
setSupportedTypes(Set<MediaType> supportedTypes)
-
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
-
-
-
Field Detail
-
INPUT_FILE_TOKEN
public static final String INPUT_FILE_TOKEN
The token, which if present in the Command string, will be replaced with the input filename. Alternately, the input data can be streamed over STDIN.- See Also:
- Constant Field Values
-
OUTPUT_FILE_TOKEN
public static final String OUTPUT_FILE_TOKEN
The token, which if present in the Command string, will be replaced with the output filename. Alternately, the output data can be collected on STDOUT.- See Also:
- Constant Field Values
-
-
Method Detail
-
check
public static boolean check(String checkCmd, int... errorValue)
Checks to see if the command can be run. Typically used with something like "myapp --version" to check to see if "myapp" is installed and on the path.- Parameters:
checkCmd
- The check command to runerrorValue
- What is considered an error value?
-
check
public static boolean check(String[] checkCmd, int... errorValue)
-
getSupportedTypes
public Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
getCommand
public String[] getCommand()
-
setCommand
public void setCommand(String... command)
Sets the command to be run. This can include either ofINPUT_FILE_TOKEN
orOUTPUT_FILE_TOKEN
if the command needs filenames.- See Also:
Runtime.exec(String[])
-
getIgnoredLineConsumer
public ExternalParser.LineConsumer getIgnoredLineConsumer()
Gets lines consumer- Returns:
- consumer instance
-
setIgnoredLineConsumer
public void setIgnoredLineConsumer(ExternalParser.LineConsumer ignoredLineConsumer)
Set a consumer for the lines ignored by the parse functions- Parameters:
ignoredLineConsumer
- consumer instance
-
setMetadataExtractionPatterns
public void setMetadataExtractionPatterns(Map<Pattern,String> patterns)
Sets the map of regular expression patterns and Metadata keys. Any matching patterns will have the matching metadata entries set. Set this to null to disable Metadata extraction.
-
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler. Metadata is only extracted ifsetMetadataExtractionPatterns(Map)
has been called to set patterns.- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
IOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-
-