External Parser Configuration

Table of Contents

Key Concepts
Configuration Options
Examples
Changes from 3.x

The ExternalParser allows Tika to delegate parsing to external command-line programs such as ffmpeg, exiftool, or sox. Each external parser is configured via JSON and must be explicitly enabled — Tika 4.x does not auto-discover external tools at startup.

Key Concepts

Lazy Check

Each external parser can declare a checkCommandLine that verifies the tool is installed. The check runs lazily on first use (not at startup), and if the tool is not found, the parser silently disables itself.

Stream Handlers

An external process produces up to three output streams. Each can have an independent handler (any Tika parser):

stdoutHandler — processes stdout
stderrHandler — processes stderr
outputFileHandler — processes the output file (when ${OUTPUT_FILE} is used)

Handlers extract metadata, content, or both. regex-capture-parser is the most common choice for extracting metadata via regex patterns.

Content Source

The contentSource field controls which stream provides the XHTML text content:

"stdout" — default when no ${OUTPUT_FILE} in the command
"outputFile" — default when ${OUTPUT_FILE} is in the command
"stderr" — use stderr as the content source
"none" — metadata-only mode, no text content extracted

When a handler is configured for the content source stream, its ContentHandler output becomes the XHTML content. When no handler is configured, the raw bytes are written as text.

Configuration Options

Field Type Description

Field	Type	Description
`commandLine`	`List<String>`	The command and arguments to run. Use `${INPUT_FILE}` and `${OUTPUT_FILE}` tokens for file paths.
`checkCommandLine`	`List<String>`	Optional. Command to verify the tool is installed (e.g., `["ffmpeg", "-version"]`).
`checkErrorCodes`	`List<Integer>`	Exit codes that indicate the tool is not available. Default: `[127]`.
`stdoutHandler`	Parser config	Optional. Parser to process stdout.
`stderrHandler`	Parser config	Optional. Parser to process stderr.
`outputFileHandler`	Parser config	Optional. Parser to process the output file.
`contentSource`	`String`	Which stream provides XHTML content: `"stdout"`, `"stderr"`, `"outputFile"`, or `"none"`. Default depends on command.
`returnStdout`	`boolean`	Store raw stdout in metadata. Default: `false`.
`returnStderr`	`boolean`	Store raw stderr in metadata. Default: `true`.
`timeoutMs`	`long`	Process timeout in milliseconds. Default: `60000`.
`maxStdOut`	`int`	Maximum stdout bytes to capture. Default: `10000`.
`maxStdErr`	`int`	Maximum stderr bytes to capture. Default: `10000`.

commandLine

List<String>

The command and arguments to run. Use ${INPUT_FILE} and ${OUTPUT_FILE} tokens for file paths.

checkCommandLine

List<String>

Optional. Command to verify the tool is installed (e.g., ["ffmpeg", "-version"]).

checkErrorCodes

List<Integer>

Exit codes that indicate the tool is not available. Default: [127].

stdoutHandler

Parser config

Optional. Parser to process stdout.

stderrHandler

Parser config

Optional. Parser to process stderr.

outputFileHandler

Parser config

Optional. Parser to process the output file.

contentSource

String

Which stream provides XHTML content: "stdout", "stderr", "outputFile", or "none". Default depends on command.

returnStdout

boolean

Store raw stdout in metadata. Default: false.

returnStderr

boolean

Store raw stderr in metadata. Default: true.

timeoutMs

long

Process timeout in milliseconds. Default: 60000.

maxStdOut

int

Maximum stdout bytes to capture. Default: 10000.

maxStdErr

int

Maximum stderr bytes to capture. Default: 10000.

Examples

Exiftool (metadata from stdout)

Extracts metadata from media files using exiftool. The stdoutHandler uses regex-capture-parser to extract key-value pairs from exiftool’s stdout.

{
  "parsers": [
    {
      "external-parser": {
        "_mime-include": [
          "video/avi",
          "video/mpeg",
          "video/x-msvideo",
          "video/mp4"
        ],
        "commandLine": ["exiftool", "${INPUT_FILE}"],
        "checkCommandLine": ["exiftool", "-ver"],
        "checkErrorCodes": [126, 127],
        "contentSource": "none",
        "stdoutHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "mime": "^MIME Type\\s+: ([^\\r\\n]+)",
              "pages": "^Page Count\\s+: ([^\\r\\n]+)",
              "pdf:version": "^PDF Version\\s+: ([^\\r\\n]+)"
            }
          }
        }
      }
    }
  ]
}

View source on GitHub

FFmpeg (metadata from stderr)

Extracts audio/video metadata from ffmpeg -i output. FFmpeg writes metadata to stderr, so this uses stderrHandler.

{
  "parsers": [
    {
      "external-parser": {
        "_mime-include": [
          "video/avi",
          "video/mpeg",
          "video/x-msvideo"
        ],
        "commandLine": ["ffmpeg", "-i", "${INPUT_FILE}"],
        "checkCommandLine": ["ffmpeg", "-version"],
        "checkErrorCodes": [126, 127],
        "contentSource": "none",
        "returnStderr": true,
        "maxStdErr": 20000,
        "stderrHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "xmpDM:audioSampleRate": "\\s*Stream.*:.+Audio:.*,\\s+(\\d+)\\s+Hz,.*",
              "xmpDM:audioChannelType": "\\s*Stream.*:.+Audio:.*\\d+\\s+Hz,\\s+(\\d{1,2})\\s+channels.*",
              "xmpDM:audioCompressor": "\\s*Stream.*:.+Audio:\\s+([A-Za-z0-9_\\(\\)/\\[\\] ]+),.*",
              "xmpDM:duration": "\\s*Duration:\\s*([0-9:\\.]+),.*",
              "xmpDM:fileDataRate": "\\s*Duration:.*,\\s*bitrate:\\s+([0-9A-Za-z/ ]+).*",
              "xmpDM:videoColorSpace": "\\s*Stream.*:\\s+Video:\\s+[A-Za-z0-9\\(\\)/ ]+,\\s+([A-Za-z0-9\\(\\) ,]+),\\s+[0-9x]+,.*",
              "xmpDM:videoCompressor": "\\s*Stream.*:\\s+Video:\\s+([A-Za-z0-9\\(\\)/ ]+),.*",
              "xmpDM:videoFrameRate": "\\s*Stream.*:\\s+Video:.*,\\s+([0-9]+)\\s+fps,.*",
              "encoder": "\\s*encoder\\s*\\:\\s*(\\w+).*",
              "videoResolution": "\\s*Stream.*:\\s+Video:.*,\\s+([0-9x]+),.*"
            }
          }
        }
      }
    }
  ]
}

View source on GitHub

Sox (audio metadata from stderr)

Extracts audio metadata using sox --info. Like FFmpeg, Sox writes to stderr.

{
  "parsers": [
    {
      "external-parser": {
        "_mime-include": [
          "audio/mpeg",
          "audio/mp3",
          "audio/wav",
          "audio/x-wav",
          "audio/ogg",
          "audio/vorbis",
          "audio/mp4"
        ],
        "commandLine": ["sox", "--info", "${INPUT_FILE}"],
        "checkCommandLine": ["sox", "--version"],
        "checkErrorCodes": [126, 127],
        "returnStderr": true,
        "maxStdErr": 10000,
        "contentSource": "none",
        "stderrHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "xmpDM:audioChannelType": "\\s*Channels.*:\\s+(\\d+)\\s*",
              "xmpDM:audioSampleRate": "\\s*Sample Rate.*:\\s+(\\d+)\\s*",
              "xmpDM:audioSampleType": "\\s*Precision.*:\\s+([\\d\\w-]+)\\s*",
              "xmpDM:duration": "\\s*Duration.*:\\s+([\\d:\\.]+)\\s*",
              "File Size": "\\s*File Size.*:\\s+([\\d\\w]+)\\s*",
              "xmpDM:fileDataRate": "\\s*Bit Rate.*:\\s+([\\d\\w]+)\\s*",
              "Sample Encoding": "\\s*Sample Encoding.*:\\s+(.*)\\s*",
              "xmpDM:logComment": "\\s*Comment.*:\\s+(.*)\\s*"
            }
          }
        }
      }
    }
  ]
}

View source on GitHub

Multiple External Parsers

You can configure multiple external parsers in a single config file. Each handles different MIME types via _mime-include. Here FFmpeg handles video files while exiftool handles PDFs:

{
  "parsers": [
    {
      "external-parser": {
        "_mime-include": [
          "video/avi",
          "video/mpeg",
          "video/x-msvideo"
        ],
        "commandLine": ["ffmpeg", "-i", "${INPUT_FILE}"],
        "checkCommandLine": ["ffmpeg", "-version"],
        "checkErrorCodes": [126, 127],
        "returnStderr": true,
        "maxStdErr": 20000,
        "contentSource": "none",
        "stderrHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "xmpDM:duration": "\\s*Duration:\\s*([0-9:\\.]+),.*",
              "xmpDM:audioSampleRate": "\\s*Stream.*:.+Audio:.*,\\s+(\\d+)\\s+Hz,.*"
            }
          }
        }
      }
    },
    {
      "external-parser": {
        "_mime-include": [
          "application/pdf"
        ],
        "commandLine": ["exiftool", "${INPUT_FILE}"],
        "checkCommandLine": ["exiftool", "-ver"],
        "checkErrorCodes": [126, 127],
        "contentSource": "none",
        "stdoutHandler": {
          "regex-capture-parser": {
            "captureMap": {
              "exiftool:MIMEType": "^MIME Type\\s+: ([^\\r\\n]+)",
              "exiftool:PageCount": "^Page Count\\s+: ([^\\r\\n]+)",
              "exiftool:PDFVersion": "^PDF Version\\s+: ([^\\r\\n]+)"
            }
          }
        }
      }
    }
  ]
}

View source on GitHub

Changes from 3.x

In Tika 3.x, external parsers were configured via XML (tika-external-parsers.xml) and auto-discovered at startup. The CompositeExternalParser would fork a process for each configured tool on every Tika initialization to check if the tool was available.

In Tika 4.x:

External parsers must be explicitly configured in JSON — no auto-discovery.
The checkCommandLine runs lazily on first use, not at startup.
Three independent stream handlers (stdoutHandler, stderrHandler, outputFileHandler) replace the old outputParser/stderrParser split.
The contentSource field explicitly controls which stream provides text content.
CompositeExternalParser, ExternalParsersFactory, and the XML config reader have been removed.

See Migrating to 4.x for general migration guidance.