Getting Started with Tika Pipes
This guide walks through a complete working example: reading files from a directory, parsing them, and writing JSON metadata to an output directory.
Quick Start with tika-app
The simplest way to use Tika Pipes is through tika-app:
java -jar tika-app.jar -i /data/input -o /data/output
This recursively processes all files in /data/input and writes one .json
file per document to /data/output. Each JSON file contains the extracted
metadata and text content.
Common Options
# Use 4 parallel forked processes
java -jar tika-app.jar -i /data/input -o /data/output -n 4
# Set memory limit per forked process
java -jar tika-app.jar -i /data/input -o /data/output -n 4 -X 512m
# Set parse timeout (milliseconds)
java -jar tika-app.jar -i /data/input -o /data/output -T 120000
# Extract plain text only (no HTML)
java -jar tika-app.jar -i /data/input -o /data/output --handler t
# Recursively unpack all embedded documents
java -jar tika-app.jar -i /data/input -o /data/output -Z
Handler types: t (text), h (html), x (xml), m (markdown), b (body), i (ignore/metadata only).
JSON Configuration
For more control, create a JSON config file. Here is a complete filesystem-to-filesystem pipeline:
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "FETCHER_BASE_PATH",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "EMITTER_BASE_PATH",
"fileExtension": "json",
"onExists": "EXCEPTION"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "FETCHER_BASE_PATH",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4,
"emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
"forkedJvmArgs": ["-Xmx512m"],
"emitStrategy": {
"type": "DYNAMIC",
"thresholdBytes": 1000000
}
},
"auto-detect-parser": {
"throwOnZeroBytes": false
},
"parse-context": {
"mock-digester-factory": {},
"timeout-limits": {
"progressTimeoutMillis": 5000
}
},
"plugin-roots": "PLUGINS_PATHS"
}
The values shown like FETCHER_BASE_PATH, EMITTER_BASE_PATH, PLUGINS_PATHS, and EMIT_INTERMEDIATE_RESULTS are placeholders the integration tests substitute at runtime. Replace them with real paths (or, for EMIT_INTERMEDIATE_RESULTS, the boolean true/false) in your own config.
|
Run it with:
java -jar tika-app.jar --config=tika-config.json -i /data/input -o /data/output
The -i and -o flags override the basePath values in the config when used
with tika-app. The config file is useful for setting other options like extractFileSystemMetadata,
onExists, and prettyPrint.
|
How It Works
A Tika Pipes pipeline has four components:
-
Pipes Iterator — enumerates the documents to process (e.g., walk a directory, list an S3 bucket, query a database)
-
Fetcher — retrieves each document’s bytes (e.g., read from filesystem, download from S3)
-
Parsers — extract text and metadata (runs in a forked JVM for robustness)
-
Emitter — writes the results (e.g., JSON to filesystem, index to Elasticsearch)
Iterator --> Fetcher --> [forked JVM: Parse] --> Emitter
Each parse runs in an isolated forked process with configurable timeouts and memory limits. If a parse hangs or crashes, only that forked process is affected — the pipeline continues with the remaining documents.
Pipeline Configuration Options
The pipes section controls the pipeline behavior:
| Field | Default | Description |
|---|---|---|
|
|
Number of parallel forked parse processes |
|
|
Output mode: |
|
|
Maximum time (ms) for a single parse operation |
|
|
Restart forked processes after this many files (prevents memory leaks) |
|
|
What to do on parse failure: |
See Parse Modes and Timeouts for details.