Package org.apache.tika.parser.external
Class ExternalParser
java.lang.Object
org.apache.tika.parser.external.ExternalParser
- All Implemented Interfaces:
Serializable,SelfConfiguring,Parser
Parser that uses an external program (like ffmpeg, exiftool or sox)
to extract text content and metadata from a given document.
This parser relies on JSON configuration rather than classpath auto-discovery. Users can specify independent handlers for each process stream:
stdoutHandler— processes stdoutstderrHandler— processes stderroutputFileHandler— processes the output file
contentSource field controls which stream provides the XHTML
content output. An optional checkCommandLine lazily verifies the
external tool is available.- See Also:
-
Field Summary
Fields -
Constructor Summary
ConstructorsConstructorDescriptionDefault constructor - not typically useful since ExternalParser requires configuration.ExternalParser(JsonConfig jsonConfig) JSON config constructor - used for deserialization.ExternalParser(ExternalParserConfig config) Programmatic constructor with typed config. -
Method Summary
Modifier and TypeMethodDescriptionReturns the configuration for this parser.getSupportedTypes(ParseContext context) Returns the set of media types supported by this parser when used with the given parse context.voidparse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext context) Parses a document stream into a sequence of XHTML SAX events.
-
Field Details
-
DEFAULT_TIMEOUT_MS
public static final long DEFAULT_TIMEOUT_MS- See Also:
-
INPUT_FILE_TOKEN
- See Also:
-
OUTPUT_FILE_TOKEN
- See Also:
-
-
Constructor Details
-
ExternalParser
public ExternalParser()Default constructor - not typically useful since ExternalParser requires configuration. -
ExternalParser
Programmatic constructor with typed config. -
ExternalParser
JSON config constructor - used for deserialization.
-
-
Method Details
-
getSupportedTypes
Description copied from interface:ParserReturns the set of media types supported by this parser when used with the given parse context.- Specified by:
getSupportedTypesin interfaceParser- Parameters:
context- parse context- Returns:
- immutable set of media types
-
parse
public void parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Description copied from interface:ParserParses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
- Specified by:
parsein interfaceParserhandler- handler for the XHTML SAX events (output)metadata- document metadata (input and output)context- parse context- Throws:
IOException- if the document stream could not be readSAXException- if the SAX events could not be processedTikaException- if the document could not be parsed
-
getConfig
Returns the configuration for this parser.
-