Getting Started with Tika Pipes
This guide walks through a complete working example: reading files from a directory, parsing them, and writing JSON metadata to an output directory.
Quick Start with tika-app
The simplest way to use Tika Pipes is through tika-app:
java -jar tika-app.jar -i /data/input -o /data/output
This recursively processes all files in /data/input and writes one .json
file per document to /data/output. Each JSON file contains the extracted
metadata and text content.
Common Options
# Use 4 parallel forked processes
java -jar tika-app.jar -i /data/input -o /data/output -n 4
# Set memory limit per forked process
java -jar tika-app.jar -i /data/input -o /data/output -n 4 -X 512m
# Set parse timeout (milliseconds)
java -jar tika-app.jar -i /data/input -o /data/output -T 120000
# Extract plain text only (no HTML)
java -jar tika-app.jar -i /data/input -o /data/output --handler t
# Recursively unpack all embedded documents
java -jar tika-app.jar -i /data/input -o /data/output -Z
Handler types: t (text), h (html), x (xml), m (markdown), b (body), i (ignore/metadata only).
JSON Configuration
For more control, create a JSON config file. Here is a complete filesystem-to-filesystem pipeline:
{
"fetchers": [
{
"file-system-fetcher": {
"id": "input-fetcher",
"basePath": "/data/input",
"extractFileSystemMetadata": true
}
}
],
"emitters": [
{
"file-system-emitter": {
"id": "output-emitter",
"basePath": "/data/output",
"fileExtension": "json",
"onExists": "SKIP",
"prettyPrint": false
}
}
],
"parsers": [
{
"default-parser": {}
}
]
}
Run it with:
java -jar tika-app.jar --config tika-config.json -i /data/input -o /data/output
The -i and -o flags override the basePath values in the config when used
with tika-app. The config file is useful for setting other options like extractFileSystemMetadata,
onExists, and prettyPrint.
|
How It Works
A Tika Pipes pipeline has four components:
-
Pipes Iterator — enumerates the documents to process (e.g., walk a directory, list an S3 bucket, query a database)
-
Fetcher — retrieves each document’s bytes (e.g., read from filesystem, download from S3)
-
Parsers — extract text and metadata (runs in a forked JVM for robustness)
-
Emitter — writes the results (e.g., JSON to filesystem, index to Elasticsearch)
Iterator --> Fetcher --> [forked JVM: Parse] --> Emitter
Each parse runs in an isolated forked process with configurable timeouts and memory limits. If a parse hangs or crashes, only that forked process is affected — the pipeline continues with the remaining documents.
Pipeline Configuration Options
The pipes section controls the pipeline behavior:
| Field | Default | Description |
|---|---|---|
|
|
Number of parallel forked parse processes |
|
|
Output mode: |
|
|
Maximum time (ms) for a single parse operation |
|
|
Restart forked processes after this many files (prevents memory leaks) |
|
|
What to do on parse failure: |
See Parse Modes and Timeouts for details.