Pipes Configuration
The pipes section of the JSON config controls the pipeline process itself:
how many forked JVMs to run, timeouts, memory management, and parse behavior.
{
"pipes": {
"numClients": 4,
"socketTimeoutMs": 60000,
"maxFilesProcessedPerProcess": 10000,
"parseMode": "RMETA",
"onParseException": "EMIT",
"forkedJvmArgs": ["-Xmx512m"]
}
}
Process Management
| Field | Default | Description |
|---|---|---|
|
|
Number of parallel forked JVM processes. Each processes one document at a time. See Forked-JVM CPU Sizing for guidance on choosing this value relative to host CPU count. |
|
|
JVM arguments for forked processes (e.g., |
|
|
Path to the Java executable for forked processes. |
|
|
Restart forked processes after this many files. Prevents slow-building memory leaks in parsing libraries. |
|
system default |
Directory for temporary files. Consider a RAM-backed filesystem (e.g., |
Timeouts
See also Timeouts for the full timeout model.
| Field | Default | Description |
|---|---|---|
|
|
Maximum time (ms) to wait for data from a forked process. If no heartbeat or result is received within this window, the parse is considered hung. |
|
|
Interval (ms) between heartbeats sent from the forked process. Must be significantly less than |
|
|
Shut down an idle forked process after this many milliseconds of inactivity. |
|
|
Maximum time (ms) to wait for an available forked process when all are busy. |
|
|
How long (seconds) a fetcher-emitter pairing can sit idle in the cache before it is eligible for eviction. Increase if your pipeline has long quiet periods between tuples that reuse the same fetcher/emitter. |
|
|
How often (seconds) the stale-fetcher reaper runs. |
Parse Behavior
| Field | Default | Description |
|---|---|---|
|
|
How embedded documents are handled: |
|
|
What to do when a parse fails: |
|
|
When |
Async / Emit Batching
These settings control how parsed results are batched before sending to emitters.
| Field | Default | Description |
|---|---|---|
|
|
Number of emitter threads. |
|
|
Size of the fetch/emit tuple queue. |
|
|
Flush the emit batch if nothing has been emitted within this many milliseconds, even if the batch is not full. |
|
|
Flush the emit batch when the estimated size reaches this many bytes. |
|
|
When |
Emit Strategy
emitStrategy controls whether parsed extracts are emitted directly from the forked PipesServer or passed back to the parent process first. The default is balanced for typical workloads — tune only if you have a memory or throughput problem.
{
"pipes": {
"emitStrategy": {
"type": "DYNAMIC",
"thresholdBytes": 100000
}
}
}
| Field | Default | Description |
|---|---|---|
|
|
One of |
|
|
Only used when |
Distributed Config Store
For multi-host pipelines (e.g., shared-server clusters) you can store fetcher/emitter configuration in a distributed backend instead of memory. Most users should leave the defaults.
| Field | Default | Description |
|---|---|---|
|
|
Backend for storing fetcher/emitter configurations. |
|
|
JSON object (as a string) with backend-specific parameters. Structure depends on |
Shared Server Mode (Experimental)
| Field | Default | Description |
|---|---|---|
|
|
When |
See Shared Server Mode for details.
Complete examples
Worked-out end-to-end configs from the test tree. Each is loaded by an automated test, so the syntax stays current.
Filesystem-to-filesystem pipeline
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "FETCHER_BASE_PATH",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "EMITTER_BASE_PATH",
"fileExtension": "json",
"onExists": "EXCEPTION"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "FETCHER_BASE_PATH",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4,
"emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
"forkedJvmArgs": ["-Xmx512m"],
"emitStrategy": {
"type": "DYNAMIC",
"thresholdBytes": 1000000
}
},
"auto-detect-parser": {
"throwOnZeroBytes": false
},
"parse-context": {
"mock-digester-factory": {},
"timeout-limits": {
"progressTimeoutMillis": 5000
}
},
"plugin-roots": "PLUGINS_PATHS"
}
Tokens (FETCHER_BASE_PATH, EMITTER_BASE_PATH, PLUGINS_PATHS, EMIT_INTERMEDIATE_RESULTS) are substituted by the test harness — replace them with real values in production configs. The first three are paths; EMIT_INTERMEDIATE_RESULTS is the boolean emitIntermediateResults flag.
Emit-all variant
{
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "FETCHER_BASE_PATH",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "EMITTER_BASE_PATH",
"fileExtension": "json",
"onExists": "EXCEPTION"
}
}
},
"pipes": {
"numClients": 1,
"forkedJvmArgs": [
"-Xmx256m"
],
"emitStrategy": {
"type": "EMIT_ALL"
}
},
"parse-context": {
"timeout-limits": {
"progressTimeoutMillis": 60000
}
},
"plugin-roots": "PLUGINS_PATHS"
}
Shared-server (YOLO) mode
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "FETCHER_BASE_PATH",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "EMITTER_BASE_PATH",
"fileExtension": "json",
"onExists": "OVERWRITE"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "FETCHER_BASE_PATH",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4,
"useSharedServer": true,
"emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
"forkedJvmArgs": ["-Xmx512m"],
"emitStrategy": {
"type": "DYNAMIC",
"thresholdBytes": 1000000
}
},
"auto-detect-parser": {
"throwOnZeroBytes": false
},
"parse-context": {
"mock-digester-factory": {},
"timeout-limits": {
"progressTimeoutMillis": 5000
}
},
"plugin-roots": "PLUGINS_PATHS"
}
See Shared Server Mode for the trade-offs.
Tika Pipes config template
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"parsers": [
{
"default-parser": {}
},
{
"pdf-parser": {
"extractActions": true,
"extractInlineImages": true,
"extractIncrementalUpdateInfo": true,
"parseIncrementalUpdates": true
}
},
{
"ooxml-parser": {
"includeDeletedContent": true,
"includeMoveFromContent": true,
"extractMacros": true
}
},
{
"office-parser": {
"extractMacros": true
}
}
],
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "FETCHER_BASE_PATH",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "EMITTER_BASE_PATH",
"fileExtension": "json",
"onExists": "EXCEPTION"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "FETCHER_BASE_PATH",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
},
"pipes": {
"parseMode": "RMETA"
},
"plugin-roots": "PLUGIN_ROOTS"
}
For per-plugin pipeline examples (S3, OpenSearch, JDBC, Kafka, etc.), see the relevant page under Plugins.