External Parser Configuration
The ExternalParser allows Tika to delegate parsing to external command-line
programs such as ffmpeg, exiftool, or sox. Each external parser is
configured via JSON and must be explicitly enabled — Tika 4.x does not
auto-discover external tools at startup.
Key Concepts
Lazy Check
Each external parser can declare a checkCommandLine that verifies the tool
is installed. The check runs lazily on first use (not at startup), and if the
tool is not found, the parser silently disables itself.
Stream Handlers
An external process produces up to three output streams. Each can have an independent handler (any Tika parser):
-
stdoutHandler— processes stdout -
stderrHandler— processes stderr -
outputFileHandler— processes the output file (when${OUTPUT_FILE}is used)
Handlers extract metadata, content, or both. regex-capture-parser is the
most common choice for extracting metadata via regex patterns.
Content Source
The contentSource field controls which stream provides the XHTML text content:
-
"stdout"— default when no${OUTPUT_FILE}in the command -
"outputFile"— default when${OUTPUT_FILE}is in the command -
"stderr"— use stderr as the content source -
"none"— metadata-only mode, no text content extracted
When a handler is configured for the content source stream, its ContentHandler output becomes the XHTML content. When no handler is configured, the raw bytes are written as text.
Configuration Options
| Field | Type | Description |
|---|---|---|
|
|
The command and arguments to run. Use |
|
|
Optional. Command to verify the tool is installed (e.g., |
|
|
Exit codes that indicate the tool is not available. Default: |
|
Parser config |
Optional. Parser to process stdout. |
|
Parser config |
Optional. Parser to process stderr. |
|
Parser config |
Optional. Parser to process the output file. |
|
|
Which stream provides XHTML content: |
|
|
Store raw stdout in metadata. Default: |
|
|
Store raw stderr in metadata. Default: |
|
|
Process timeout in milliseconds. Default: |
|
|
Maximum stdout bytes to capture. Default: |
|
|
Maximum stderr bytes to capture. Default: |
Examples
Exiftool (metadata from stdout)
Extracts metadata from media files using exiftool. The stdoutHandler uses
regex-capture-parser to extract key-value pairs from exiftool’s stdout.
{
"parsers": [
{
"external-parser": {
"_mime-include": [
"video/avi",
"video/mpeg",
"video/x-msvideo",
"video/mp4"
],
"commandLine": ["exiftool", "${INPUT_FILE}"],
"checkCommandLine": ["exiftool", "-ver"],
"checkErrorCodes": [126, 127],
"contentSource": "none",
"stdoutHandler": {
"regex-capture-parser": {
"captureMap": {
"mime": "^MIME Type\\s+: ([^\\r\\n]+)",
"pages": "^Page Count\\s+: ([^\\r\\n]+)",
"pdf:version": "^PDF Version\\s+: ([^\\r\\n]+)"
}
}
}
}
}
]
}
FFmpeg (metadata from stderr)
Extracts audio/video metadata from ffmpeg -i output. FFmpeg writes metadata
to stderr, so this uses stderrHandler.
{
"parsers": [
{
"external-parser": {
"_mime-include": [
"video/avi",
"video/mpeg",
"video/x-msvideo"
],
"commandLine": ["ffmpeg", "-i", "${INPUT_FILE}"],
"checkCommandLine": ["ffmpeg", "-version"],
"checkErrorCodes": [126, 127],
"contentSource": "none",
"returnStderr": true,
"maxStdErr": 20000,
"stderrHandler": {
"regex-capture-parser": {
"captureMap": {
"xmpDM:audioSampleRate": "\\s*Stream.*:.+Audio:.*,\\s+(\\d+)\\s+Hz,.*",
"xmpDM:audioChannelType": "\\s*Stream.*:.+Audio:.*\\d+\\s+Hz,\\s+(\\d{1,2})\\s+channels.*",
"xmpDM:audioCompressor": "\\s*Stream.*:.+Audio:\\s+([A-Za-z0-9_\\(\\)/\\[\\] ]+),.*",
"xmpDM:duration": "\\s*Duration:\\s*([0-9:\\.]+),.*",
"xmpDM:fileDataRate": "\\s*Duration:.*,\\s*bitrate:\\s+([0-9A-Za-z/ ]+).*",
"xmpDM:videoColorSpace": "\\s*Stream.*:\\s+Video:\\s+[A-Za-z0-9\\(\\)/ ]+,\\s+([A-Za-z0-9\\(\\) ,]+),\\s+[0-9x]+,.*",
"xmpDM:videoCompressor": "\\s*Stream.*:\\s+Video:\\s+([A-Za-z0-9\\(\\)/ ]+),.*",
"xmpDM:videoFrameRate": "\\s*Stream.*:\\s+Video:.*,\\s+([0-9]+)\\s+fps,.*",
"encoder": "\\s*encoder\\s*\\:\\s*(\\w+).*",
"videoResolution": "\\s*Stream.*:\\s+Video:.*,\\s+([0-9x]+),.*"
}
}
}
}
}
]
}
Sox (audio metadata from stderr)
Extracts audio metadata using sox --info. Like FFmpeg, Sox writes to stderr.
{
"parsers": [
{
"external-parser": {
"_mime-include": [
"audio/mpeg",
"audio/mp3",
"audio/wav",
"audio/x-wav",
"audio/ogg",
"audio/vorbis",
"audio/mp4"
],
"commandLine": ["sox", "--info", "${INPUT_FILE}"],
"checkCommandLine": ["sox", "--version"],
"checkErrorCodes": [126, 127],
"returnStderr": true,
"maxStdErr": 10000,
"contentSource": "none",
"stderrHandler": {
"regex-capture-parser": {
"captureMap": {
"xmpDM:audioChannelType": "\\s*Channels.*:\\s+(\\d+)\\s*",
"xmpDM:audioSampleRate": "\\s*Sample Rate.*:\\s+(\\d+)\\s*",
"xmpDM:audioSampleType": "\\s*Precision.*:\\s+([\\d\\w-]+)\\s*",
"xmpDM:duration": "\\s*Duration.*:\\s+([\\d:\\.]+)\\s*",
"File Size": "\\s*File Size.*:\\s+([\\d\\w]+)\\s*",
"xmpDM:fileDataRate": "\\s*Bit Rate.*:\\s+([\\d\\w]+)\\s*",
"Sample Encoding": "\\s*Sample Encoding.*:\\s+(.*)\\s*",
"xmpDM:logComment": "\\s*Comment.*:\\s+(.*)\\s*"
}
}
}
}
}
]
}
Multiple External Parsers
You can configure multiple external parsers in a single config file. Each
handles different MIME types via _mime-include. Here FFmpeg handles video
files while exiftool handles PDFs:
{
"parsers": [
{
"external-parser": {
"_mime-include": [
"video/avi",
"video/mpeg",
"video/x-msvideo"
],
"commandLine": ["ffmpeg", "-i", "${INPUT_FILE}"],
"checkCommandLine": ["ffmpeg", "-version"],
"checkErrorCodes": [126, 127],
"returnStderr": true,
"maxStdErr": 20000,
"contentSource": "none",
"stderrHandler": {
"regex-capture-parser": {
"captureMap": {
"xmpDM:duration": "\\s*Duration:\\s*([0-9:\\.]+),.*",
"xmpDM:audioSampleRate": "\\s*Stream.*:.+Audio:.*,\\s+(\\d+)\\s+Hz,.*"
}
}
}
}
},
{
"external-parser": {
"_mime-include": [
"application/pdf"
],
"commandLine": ["exiftool", "${INPUT_FILE}"],
"checkCommandLine": ["exiftool", "-ver"],
"checkErrorCodes": [126, 127],
"contentSource": "none",
"stdoutHandler": {
"regex-capture-parser": {
"captureMap": {
"exiftool:MIMEType": "^MIME Type\\s+: ([^\\r\\n]+)",
"exiftool:PageCount": "^Page Count\\s+: ([^\\r\\n]+)",
"exiftool:PDFVersion": "^PDF Version\\s+: ([^\\r\\n]+)"
}
}
}
}
}
]
}
Changes from 3.x
In Tika 3.x, external parsers were configured via XML (tika-external-parsers.xml)
and auto-discovered at startup. The CompositeExternalParser would fork
a process for each configured tool on every Tika initialization to check
if the tool was available.
In Tika 4.x:
-
External parsers must be explicitly configured in JSON — no auto-discovery.
-
The
checkCommandLineruns lazily on first use, not at startup. -
Three independent stream handlers (
stdoutHandler,stderrHandler,outputFileHandler) replace the oldoutputParser/stderrParsersplit. -
The
contentSourcefield explicitly controls which stream provides text content. -
CompositeExternalParser,ExternalParsersFactory, and the XML config reader have been removed.
See Migrating to 4.x for general migration guidance.