Pipes Configuration

Table of Contents

Process Management
Timeouts
Parse Behavior
Async / Emit Batching
Emit Strategy
Distributed Config Store
Shared Server Mode (Experimental)
Complete examples

The pipes section of the JSON config controls the pipeline process itself: how many forked JVMs to run, timeouts, memory management, and parse behavior.

{
  "pipes": {
    "numClients": 4,
    "socketTimeoutMs": 60000,
    "maxFilesProcessedPerProcess": 10000,
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "forkedJvmArgs": ["-Xmx512m"]
  }
}

Process Management

Field Default Description

Field	Default	Description
`numClients`	`4`	Number of parallel forked JVM processes. Each processes one document at a time. See Forked-JVM CPU Sizing for guidance on choosing this value relative to host CPU count.
`forkedJvmArgs`	`[]`	JVM arguments for forked processes (e.g., `["-Xmx512m", "-Xms256m"]`). When `numClients > 1`, Tika auto-injects `-XX:ActiveProcessorCount` to right-size each fork’s GC and JIT thread pools unless you provide your own; see Forked-JVM CPU Sizing.
`javaPath`	`java`	Path to the Java executable for forked processes.
`maxFilesProcessedPerProcess`	`10000`	Restart forked processes after this many files. Prevents slow-building memory leaks in parsing libraries.
`tempDirectory`	system default	Directory for temporary files. Consider a RAM-backed filesystem (e.g., `/dev/shm`) for better performance.

numClients

4

Number of parallel forked JVM processes. Each processes one document at a time. See Forked-JVM CPU Sizing for guidance on choosing this value relative to host CPU count.

forkedJvmArgs

[]

JVM arguments for forked processes (e.g., ["-Xmx512m", "-Xms256m"]). When numClients > 1, Tika auto-injects -XX:ActiveProcessorCount to right-size each fork’s GC and JIT thread pools unless you provide your own; see Forked-JVM CPU Sizing.

javaPath

java

Path to the Java executable for forked processes.

maxFilesProcessedPerProcess

10000

Restart forked processes after this many files. Prevents slow-building memory leaks in parsing libraries.

tempDirectory

system default

Directory for temporary files. Consider a RAM-backed filesystem (e.g., /dev/shm) for better performance.

Timeouts

See also Timeouts for the full timeout model.

Field Default Description

Field	Default	Description
`socketTimeoutMs`	`60000`	Maximum time (ms) to wait for data from a forked process. If no heartbeat or result is received within this window, the parse is considered hung.
`heartbeatIntervalMs`	`1000`	Interval (ms) between heartbeats sent from the forked process. Must be significantly less than `socketTimeoutMs`.
`shutdownClientAfterMillis`	`300000`	Shut down an idle forked process after this many milliseconds of inactivity.
`maxWaitForClientMillis`	`60000`	Maximum time (ms) to wait for an available forked process when all are busy.
`staleFetcherTimeoutSeconds`	`600`	How long (seconds) a fetcher-emitter pairing can sit idle in the cache before it is eligible for eviction. Increase if your pipeline has long quiet periods between tuples that reuse the same fetcher/emitter.
`staleFetcherDelaySeconds`	`60`	How often (seconds) the stale-fetcher reaper runs.

socketTimeoutMs

60000

Maximum time (ms) to wait for data from a forked process. If no heartbeat or result is received within this window, the parse is considered hung.

heartbeatIntervalMs

1000

Interval (ms) between heartbeats sent from the forked process. Must be significantly less than socketTimeoutMs.

shutdownClientAfterMillis

300000

Shut down an idle forked process after this many milliseconds of inactivity.

maxWaitForClientMillis

60000

Maximum time (ms) to wait for an available forked process when all are busy.

staleFetcherTimeoutSeconds

600

How long (seconds) a fetcher-emitter pairing can sit idle in the cache before it is eligible for eviction. Increase if your pipeline has long quiet periods between tuples that reuse the same fetcher/emitter.

staleFetcherDelaySeconds

60

How often (seconds) the stale-fetcher reaper runs.

Parse Behavior

Field Default Description

Field	Default	Description
`parseMode`	`RMETA`	How embedded documents are handled: `RMETA` (recursive metadata list), `CONCATENATE`, `CONTENT_ONLY`, `NO_PARSE`, `UNPACK`. See Parse Modes.
`onParseException`	`EMIT`	What to do when a parse fails: `EMIT` (emit error metadata) or `SKIP` (silently skip).
`stopOnlyOnFatal`	`false`	When `false`, stop the pipeline on configuration errors (missing fetcher/emitter). When `true`, only stop on fatal initialization failures. Use `true` for server mode, `false` for batch mode.

parseMode

RMETA

How embedded documents are handled: RMETA (recursive metadata list), CONCATENATE, CONTENT_ONLY, NO_PARSE, UNPACK. See Parse Modes.

onParseException

EMIT

What to do when a parse fails: EMIT (emit error metadata) or SKIP (silently skip).

stopOnlyOnFatal

false

When false, stop the pipeline on configuration errors (missing fetcher/emitter). When true, only stop on fatal initialization failures. Use true for server mode, false for batch mode.

Async / Emit Batching

These settings control how parsed results are batched before sending to emitters.

Field Default Description

Field	Default	Description
`numEmitters`	`1`	Number of emitter threads.
`queueSize`	`10000`	Size of the fetch/emit tuple queue.
`emitWithinMillis`	`10000`	Flush the emit batch if nothing has been emitted within this many milliseconds, even if the batch is not full.
`emitMaxEstimatedBytes`	`100000`	Flush the emit batch when the estimated size reaches this many bytes.
`emitIntermediateResults`	`false`	When `false`, only successfully-parsed tuples reach the emitter — files that crash, time out, or otherwise fail are dropped from the output. When `true`, every tuple is emitted, including failures (the metadata carries the exception). Turn this on if you need a complete record of what was attempted (audit, retry logic, chaos-monkey tests).

numEmitters

1

Number of emitter threads.

queueSize

10000

Size of the fetch/emit tuple queue.

emitWithinMillis

10000

Flush the emit batch if nothing has been emitted within this many milliseconds, even if the batch is not full.

emitMaxEstimatedBytes

100000

Flush the emit batch when the estimated size reaches this many bytes.

emitIntermediateResults

false

When false, only successfully-parsed tuples reach the emitter — files that crash, time out, or otherwise fail are dropped from the output. When true, every tuple is emitted, including failures (the metadata carries the exception). Turn this on if you need a complete record of what was attempted (audit, retry logic, chaos-monkey tests).

Emit Strategy

emitStrategy controls whether parsed extracts are emitted directly from the forked PipesServer or passed back to the parent process first. The default is balanced for typical workloads — tune only if you have a memory or throughput problem.

{
  "pipes": {
    "emitStrategy": {
      "type": "DYNAMIC",
      "thresholdBytes": 100000
    }
  }
}

Field Default Description

Field	Default	Description
`type`	`DYNAMIC`	One of `DYNAMIC`, `EMIT_ALL`, `PASSBACK_ALL`. `DYNAMIC` switches per-extract based on size (see `thresholdBytes`). `EMIT_ALL` always emits from the forked process. `PASSBACK_ALL` always passes extracts back to the parent for emission.
`thresholdBytes`	`100000`	Only used when `type` is `DYNAMIC`. Extracts larger than this are emitted directly from the forked PipesServer; smaller ones are passed back to the parent. Setting `thresholdBytes` with type `EMIT_ALL` or `PASSBACK_ALL` is a config error.

type

DYNAMIC

One of DYNAMIC, EMIT_ALL, PASSBACK_ALL. DYNAMIC switches per-extract based on size (see thresholdBytes). EMIT_ALL always emits from the forked process. PASSBACK_ALL always passes extracts back to the parent for emission.

thresholdBytes

100000

Only used when type is DYNAMIC. Extracts larger than this are emitted directly from the forked PipesServer; smaller ones are passed back to the parent. Setting thresholdBytes with type EMIT_ALL or PASSBACK_ALL is a config error.

Distributed Config Store

For multi-host pipelines (e.g., shared-server clusters) you can store fetcher/emitter configuration in a distributed backend instead of memory. Most users should leave the defaults.

Field Default Description

Field	Default	Description
`configStoreType`	`"memory"`	Backend for storing fetcher/emitter configurations. `"memory"` (default) is in-process; `"ignite"` uses Apache Ignite for shared state across nodes.
`configStoreParams`	`"{}"`	JSON object (as a string) with backend-specific parameters. Structure depends on `configStoreType`.

configStoreType

"memory"

Backend for storing fetcher/emitter configurations. "memory" (default) is in-process; "ignite" uses Apache Ignite for shared state across nodes.

configStoreParams

"{}"

JSON object (as a string) with backend-specific parameters. Structure depends on configStoreType.

Shared Server Mode (Experimental)

Field Default Description

Field	Default	Description
`useSharedServer`	`false`	When `true`, multiple clients share a single forked JVM instead of each having its own. Reduces memory overhead but sacrifices isolation — one crash affects all in-flight requests. Not recommended for production.

useSharedServer

false

When true, multiple clients share a single forked JVM instead of each having its own. Reduces memory overhead but sacrifices isolation — one crash affects all in-flight requests. Not recommended for production.

See Shared Server Mode for details.

Complete examples

Worked-out end-to-end configs from the test tree. Each is loaded by an automated test, so the syntax stays current.

Filesystem-to-filesystem pipeline

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "FETCHER_BASE_PATH",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "EMITTER_BASE_PATH",
        "fileExtension": "json",
        "onExists": "EXCEPTION"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "FETCHER_BASE_PATH",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4,
    "emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
    "forkedJvmArgs": ["-Xmx512m"],
    "emitStrategy": {
      "type": "DYNAMIC",
      "thresholdBytes": 1000000
    }
  },
  "auto-detect-parser": {
    "throwOnZeroBytes": false
  },
  "parse-context": {
    "mock-digester-factory": {},
    "timeout-limits": {
      "progressTimeoutMillis": 5000
    }
  },
  "plugin-roots": "PLUGINS_PATHS"
}

View source on GitHub

Tokens (FETCHER_BASE_PATH, EMITTER_BASE_PATH, PLUGINS_PATHS, EMIT_INTERMEDIATE_RESULTS) are substituted by the test harness — replace them with real values in production configs. The first three are paths; EMIT_INTERMEDIATE_RESULTS is the boolean emitIntermediateResults flag.

Emit-all variant

{
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "FETCHER_BASE_PATH",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "EMITTER_BASE_PATH",
        "fileExtension": "json",
        "onExists": "EXCEPTION"
      }
    }
  },
  "pipes": {
    "numClients": 1,
    "forkedJvmArgs": [
      "-Xmx256m"
    ],
    "emitStrategy": {
      "type": "EMIT_ALL"
    }
  },
  "parse-context": {
    "timeout-limits": {
      "progressTimeoutMillis": 60000
    }
  },
  "plugin-roots": "PLUGINS_PATHS"
}

View source on GitHub

Shared-server (YOLO) mode

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "FETCHER_BASE_PATH",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "EMITTER_BASE_PATH",
        "fileExtension": "json",
        "onExists": "OVERWRITE"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "FETCHER_BASE_PATH",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4,
    "useSharedServer": true,
    "emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
    "forkedJvmArgs": ["-Xmx512m"],
    "emitStrategy": {
      "type": "DYNAMIC",
      "thresholdBytes": 1000000
    }
  },
  "auto-detect-parser": {
    "throwOnZeroBytes": false
  },
  "parse-context": {
    "mock-digester-factory": {},
    "timeout-limits": {
      "progressTimeoutMillis": 5000
    }
  },
  "plugin-roots": "PLUGINS_PATHS"
}

View source on GitHub

See Shared Server Mode for the trade-offs.

Tika Pipes config template

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "parsers": [
    {
      "default-parser": {}
    },
    {
      "pdf-parser": {
        "extractActions": true,
        "extractInlineImages": true,
        "extractIncrementalUpdateInfo": true,
        "parseIncrementalUpdates": true
      }
    },
    {
      "ooxml-parser": {
        "includeDeletedContent": true,
        "includeMoveFromContent": true,
        "extractMacros": true
      }
    },
    {
      "office-parser": {
        "extractMacros": true
      }
    }
  ],
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "FETCHER_BASE_PATH",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "EMITTER_BASE_PATH",
        "fileExtension": "json",
        "onExists": "EXCEPTION"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "FETCHER_BASE_PATH",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  },
  "pipes": {
    "parseMode": "RMETA"
  },
  "plugin-roots": "PLUGIN_ROOTS"
}

View source on GitHub

For per-plugin pipeline examples (S3, OpenSearch, JDBC, Kafka, etc.), see the relevant page under Plugins.