Class ExternalParser

java.lang.Object
org.apache.tika.parser.external.ExternalParser
All Implemented Interfaces:
Serializable, SelfConfiguring, Parser

public class ExternalParser extends Object implements Parser
Parser that uses an external program (like ffmpeg, exiftool or sox) to extract text content and metadata from a given document.

This parser relies on JSON configuration rather than classpath auto-discovery. Users can specify independent handlers for each process stream:

  • stdoutHandler — processes stdout
  • stderrHandler — processes stderr
  • outputFileHandler — processes the output file
The contentSource field controls which stream provides the XHTML content output. An optional checkCommandLine lazily verifies the external tool is available.
See Also:
  • Field Details

  • Constructor Details

    • ExternalParser

      public ExternalParser()
      Default constructor - not typically useful since ExternalParser requires configuration.
    • ExternalParser

      public ExternalParser(ExternalParserConfig config)
      Programmatic constructor with typed config.
    • ExternalParser

      public ExternalParser(JsonConfig jsonConfig)
      JSON config constructor - used for deserialization.
  • Method Details

    • getSupportedTypes

      public Set<MediaType> getSupportedTypes(ParseContext context)
      Description copied from interface: Parser
      Returns the set of media types supported by this parser when used with the given parse context.
      Specified by:
      getSupportedTypes in interface Parser
      Parameters:
      context - parse context
      Returns:
      immutable set of media types
    • parse

      public void parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
      Description copied from interface: Parser
      Parses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.

      The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.

      Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.

      Specified by:
      parse in interface Parser
      handler - handler for the XHTML SAX events (output)
      metadata - document metadata (input and output)
      context - parse context
      Throws:
      IOException - if the document stream could not be read
      SAXException - if the SAX events could not be processed
      TikaException - if the document could not be parsed
    • getConfig

      public ExternalParserConfig getConfig()
      Returns the configuration for this parser.