Package org.apache.tika.parser.external
Class ExternalParser
- java.lang.Object
- 
- org.apache.tika.parser.external.ExternalParser
 
- 
- All Implemented Interfaces:
- Serializable,- Parser
 - Direct Known Subclasses:
- TensorflowImageRecParser
 
 public class ExternalParser extends Object implements Parser Parser that uses an external program (like catdoc or pdf2txt) to extract text content and metadata from a given document.- See Also:
- Serialized Form
 
- 
- 
Nested Class SummaryNested Classes Modifier and Type Class Description static interfaceExternalParser.LineConsumerConsumer contract
 - 
Field SummaryFields Modifier and Type Field Description static StringINPUT_FILE_TOKENThe token, which if present in the Command string, will be replaced with the input filename.static StringOUTPUT_FILE_TOKENThe token, which if present in the Command string, will be replaced with the output filename.
 - 
Constructor SummaryConstructors Constructor Description ExternalParser()
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static booleancheck(String[] checkCmd, int... errorValue)static booleancheck(String checkCmd, int... errorValue)Checks to see if the command can be run.String[]getCommand()ExternalParser.LineConsumergetIgnoredLineConsumer()Gets lines consumerMap<Pattern,String>getMetadataExtractionPatterns()Set<MediaType>getSupportedTypes()Set<MediaType>getSupportedTypes(ParseContext context)Returns the set of media types supported by this parser when used with the given parse context.voidparse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler.voidsetCommand(String... command)Sets the command to be run.voidsetIgnoredLineConsumer(ExternalParser.LineConsumer ignoredLineConsumer)Set a consumer for the lines ignored by the parse functionsvoidsetMetadataExtractionPatterns(Map<Pattern,String> patterns)Sets the map of regular expression patterns and Metadata keys.voidsetSupportedTypes(Set<MediaType> supportedTypes)
 
- 
- 
- 
Field Detail- 
INPUT_FILE_TOKENpublic static final String INPUT_FILE_TOKEN The token, which if present in the Command string, will be replaced with the input filename. Alternately, the input data can be streamed over STDIN.- See Also:
- Constant Field Values
 
 - 
OUTPUT_FILE_TOKENpublic static final String OUTPUT_FILE_TOKEN The token, which if present in the Command string, will be replaced with the output filename. Alternately, the output data can be collected on STDOUT.- See Also:
- Constant Field Values
 
 
- 
 - 
Method Detail- 
checkpublic static boolean check(String checkCmd, int... errorValue) Checks to see if the command can be run. Typically used with something like "myapp --version" to check to see if "myapp" is installed and on the path.- Parameters:
- checkCmd- The check command to run
- errorValue- What is considered an error value?
 
 - 
checkpublic static boolean check(String[] checkCmd, int... errorValue) 
 - 
getSupportedTypespublic Set<MediaType> getSupportedTypes(ParseContext context) Description copied from interface:ParserReturns the set of media types supported by this parser when used with the given parse context.- Specified by:
- getSupportedTypesin interface- Parser
- Parameters:
- context- parse context
- Returns:
- immutable set of media types
 
 - 
getCommandpublic String[] getCommand() 
 - 
setCommandpublic void setCommand(String... command) Sets the command to be run. This can include either ofINPUT_FILE_TOKENorOUTPUT_FILE_TOKENif the command needs filenames.- See Also:
- Runtime.exec(String[])
 
 - 
getIgnoredLineConsumerpublic ExternalParser.LineConsumer getIgnoredLineConsumer() Gets lines consumer- Returns:
- consumer instance
 
 - 
setIgnoredLineConsumerpublic void setIgnoredLineConsumer(ExternalParser.LineConsumer ignoredLineConsumer) Set a consumer for the lines ignored by the parse functions- Parameters:
- ignoredLineConsumer- consumer instance
 
 - 
setMetadataExtractionPatternspublic void setMetadataExtractionPatterns(Map<Pattern,String> patterns) Sets the map of regular expression patterns and Metadata keys. Any matching patterns will have the matching metadata entries set. Set this to null to disable Metadata extraction.
 - 
parsepublic void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler. Metadata is only extracted ifsetMetadataExtractionPatterns(Map)has been called to set patterns.- Specified by:
- parsein interface- Parser
- Parameters:
- stream- the document stream (input)
- handler- handler for the XHTML SAX events (output)
- metadata- document metadata (input and output)
- context- parse context
- Throws:
- IOException- if the document stream could not be read
- SAXException- if the SAX events could not be processed
- TikaException- if the document could not be parsed
 
 
- 
 
-