External Parser Configuration

The ExternalParser allows Tika to delegate parsing to external command-line programs such as ffmpeg, exiftool, or sox. Each external parser is configured via JSON and must be explicitly enabled — Tika 4.x does not auto-discover external tools at startup.

Key Concepts

Lazy Check

Each external parser can declare a checkCommandLine that verifies the tool is installed. The check runs lazily on first use (not at startup), and if the tool is not found, the parser silently disables itself.

Stream Handlers

An external process produces up to three output streams. Each can have an independent handler (any Tika parser):

  • stdoutHandler — processes stdout

  • stderrHandler — processes stderr

  • outputFileHandler — processes the output file (when ${OUTPUT_FILE} is used)

Handlers extract metadata, content, or both. regex-capture-parser is the most common choice for extracting metadata via regex patterns.

Content Source

The contentSource field controls which stream provides the XHTML text content:

  • "stdout" — default when no ${OUTPUT_FILE} in the command

  • "outputFile" — default when ${OUTPUT_FILE} is in the command

  • "stderr" — use stderr as the content source

  • "none" — metadata-only mode, no text content extracted

When a handler is configured for the content source stream, its ContentHandler output becomes the XHTML content. When no handler is configured, the raw bytes are written as text.

Configuration Options

Field Type Description

commandLine

List<String>

The command and arguments to run. Use ${INPUT_FILE} and ${OUTPUT_FILE} tokens for file paths.

checkCommandLine

List<String>

Optional. Command to verify the tool is installed (e.g., ["ffmpeg", "-version"]).

checkErrorCodes

List<Integer>

Exit codes that indicate the tool is not available. Default: [127].

stdoutHandler

Parser config

Optional. Parser to process stdout.

stderrHandler

Parser config

Optional. Parser to process stderr.

outputFileHandler

Parser config

Optional. Parser to process the output file.

contentSource

String

Which stream provides XHTML content: "stdout", "stderr", "outputFile", or "none". Default depends on command.

returnStdout

boolean

Store raw stdout in metadata. Default: false.

returnStderr

boolean

Store raw stderr in metadata. Default: true.

timeoutMs

long

Process timeout in milliseconds. Default: 60000.

maxStdOut

int

Maximum stdout bytes to capture. Default: 10000.

maxStdErr

int

Maximum stderr bytes to capture. Default: 10000.

Examples

Exiftool (metadata from stdout)

Extracts metadata from media files using exiftool. The stdoutHandler uses regex-capture-parser to extract key-value pairs from exiftool’s stdout.

{
  "parsers": [
    {
      "external-parser": {
        "_mime-include": [
          "video/avi",
          "video/mpeg",
          "video/x-msvideo",
          "video/mp4"
        ],
        "commandLine": ["exiftool", "${INPUT_FILE}"],
        "checkCommandLine": ["exiftool", "-ver"],
        "checkErrorCodes": [126, 127],
        "contentSource": "none",
        "stdoutHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "mime": "^MIME Type\\s+: ([^\\r\\n]+)",
              "pages": "^Page Count\\s+: ([^\\r\\n]+)",
              "pdf:version": "^PDF Version\\s+: ([^\\r\\n]+)"
            }
          }
        }
      }
    }
  ]
}

FFmpeg (metadata from stderr)

Extracts audio/video metadata from ffmpeg -i output. FFmpeg writes metadata to stderr, so this uses stderrHandler.

{
  "parsers": [
    {
      "external-parser": {
        "_mime-include": [
          "video/avi",
          "video/mpeg",
          "video/x-msvideo"
        ],
        "commandLine": ["ffmpeg", "-i", "${INPUT_FILE}"],
        "checkCommandLine": ["ffmpeg", "-version"],
        "checkErrorCodes": [126, 127],
        "contentSource": "none",
        "returnStderr": true,
        "maxStdErr": 20000,
        "stderrHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "xmpDM:audioSampleRate": "\\s*Stream.*:.+Audio:.*,\\s+(\\d+)\\s+Hz,.*",
              "xmpDM:audioChannelType": "\\s*Stream.*:.+Audio:.*\\d+\\s+Hz,\\s+(\\d{1,2})\\s+channels.*",
              "xmpDM:audioCompressor": "\\s*Stream.*:.+Audio:\\s+([A-Za-z0-9_\\(\\)/\\[\\] ]+),.*",
              "xmpDM:duration": "\\s*Duration:\\s*([0-9:\\.]+),.*",
              "xmpDM:fileDataRate": "\\s*Duration:.*,\\s*bitrate:\\s+([0-9A-Za-z/ ]+).*",
              "xmpDM:videoColorSpace": "\\s*Stream.*:\\s+Video:\\s+[A-Za-z0-9\\(\\)/ ]+,\\s+([A-Za-z0-9\\(\\) ,]+),\\s+[0-9x]+,.*",
              "xmpDM:videoCompressor": "\\s*Stream.*:\\s+Video:\\s+([A-Za-z0-9\\(\\)/ ]+),.*",
              "xmpDM:videoFrameRate": "\\s*Stream.*:\\s+Video:.*,\\s+([0-9]+)\\s+fps,.*",
              "encoder": "\\s*encoder\\s*\\:\\s*(\\w+).*",
              "videoResolution": "\\s*Stream.*:\\s+Video:.*,\\s+([0-9x]+),.*"
            }
          }
        }
      }
    }
  ]
}

Sox (audio metadata from stderr)

Extracts audio metadata using sox --info. Like FFmpeg, Sox writes to stderr.

{
  "parsers": [
    {
      "external-parser": {
        "_mime-include": [
          "audio/mpeg",
          "audio/mp3",
          "audio/wav",
          "audio/x-wav",
          "audio/ogg",
          "audio/vorbis",
          "audio/mp4"
        ],
        "commandLine": ["sox", "--info", "${INPUT_FILE}"],
        "checkCommandLine": ["sox", "--version"],
        "checkErrorCodes": [126, 127],
        "returnStderr": true,
        "maxStdErr": 10000,
        "contentSource": "none",
        "stderrHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "xmpDM:audioChannelType": "\\s*Channels.*:\\s+(\\d+)\\s*",
              "xmpDM:audioSampleRate": "\\s*Sample Rate.*:\\s+(\\d+)\\s*",
              "xmpDM:audioSampleType": "\\s*Precision.*:\\s+([\\d\\w-]+)\\s*",
              "xmpDM:duration": "\\s*Duration.*:\\s+([\\d:\\.]+)\\s*",
              "File Size": "\\s*File Size.*:\\s+([\\d\\w]+)\\s*",
              "xmpDM:fileDataRate": "\\s*Bit Rate.*:\\s+([\\d\\w]+)\\s*",
              "Sample Encoding": "\\s*Sample Encoding.*:\\s+(.*)\\s*",
              "xmpDM:logComment": "\\s*Comment.*:\\s+(.*)\\s*"
            }
          }
        }
      }
    }
  ]
}

Multiple External Parsers

You can configure multiple external parsers in a single config file. Each handles different MIME types via _mime-include. Here FFmpeg handles video files while exiftool handles PDFs:

{
  "parsers": [
    {
      "external-parser": {
        "_mime-include": [
          "video/avi",
          "video/mpeg",
          "video/x-msvideo"
        ],
        "commandLine": ["ffmpeg", "-i", "${INPUT_FILE}"],
        "checkCommandLine": ["ffmpeg", "-version"],
        "checkErrorCodes": [126, 127],
        "returnStderr": true,
        "maxStdErr": 20000,
        "contentSource": "none",
        "stderrHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "xmpDM:duration": "\\s*Duration:\\s*([0-9:\\.]+),.*",
              "xmpDM:audioSampleRate": "\\s*Stream.*:.+Audio:.*,\\s+(\\d+)\\s+Hz,.*"
            }
          }
        }
      }
    },
    {
      "external-parser": {
        "_mime-include": [
          "application/pdf"
        ],
        "commandLine": ["exiftool", "${INPUT_FILE}"],
        "checkCommandLine": ["exiftool", "-ver"],
        "checkErrorCodes": [126, 127],
        "contentSource": "none",
        "stdoutHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "exiftool:MIMEType": "^MIME Type\\s+: ([^\\r\\n]+)",
              "exiftool:PageCount": "^Page Count\\s+: ([^\\r\\n]+)",
              "exiftool:PDFVersion": "^PDF Version\\s+: ([^\\r\\n]+)"
            }
          }
        }
      }
    }
  ]
}

Changes from 3.x

In Tika 3.x, external parsers were configured via XML (tika-external-parsers.xml) and auto-discovered at startup. The CompositeExternalParser would fork a process for each configured tool on every Tika initialization to check if the tool was available.

In Tika 4.x:

  • External parsers must be explicitly configured in JSON — no auto-discovery.

  • The checkCommandLine runs lazily on first use, not at startup.

  • Three independent stream handlers (stdoutHandler, stderrHandler, outputFileHandler) replace the old outputParser/stderrParser split.

  • The contentSource field explicitly controls which stream provides text content.

  • CompositeExternalParser, ExternalParsersFactory, and the XML config reader have been removed.

See Migrating to 4.x for general migration guidance.