Tika Pipes
This section covers Tika Pipes for scalable, fault-tolerant document processing.
Overview
Tika Pipes provides a framework for fault-tolerant, scalable document processing. Each document is parsed in a forked JVM with configurable timeouts and memory limits, so a single malformed file cannot crash or hang your application.
While Tika Pipes has a programmatic Java API, it is best used through:
-
tika-app — batch processing from the command line
-
tika-server — REST API with pipes-based robustness built in
-
tika-grpc — gRPC API with pipes-based robustness built in
See Robustness for details on how Tika Pipes protects against problematic files.
Topics
-
Getting Started — complete working example with tika-app
-
Fetchers — all available document sources (filesystem, S3, HTTP, GCS, Azure, etc.)
-
Emitters — all available output destinations (filesystem, ES, OpenSearch, Solr, S3, Kafka, etc.)
-
Iterators — document enumeration (directory walk, S3 listing, CSV, JDBC, Kafka, etc.)
-
Reporters — track per-document processing status
-
Pipeline Configuration — numClients, timeouts, JVM args, parse modes, emit batching
-
Parse Modes — control how documents are parsed and emitted (
RMETA,CONCATENATE,CONTENT_ONLY,NO_PARSE,UNPACK) -
Extracting Embedded Bytes — extract raw bytes from embedded documents
-
Timeouts — two-tier timeout system for handling long-running and hung parsers
Emitters
ES Emitter (es-emitter)
The ES emitter sends parsed documents to any ES-compatible REST API (ES 7+/8+) via
the _bulk endpoint. It uses plain HTTP (Apache HttpClient) — there is no dependency
on the ES Java client, which carries a non-ASL license.
"emitters": {
"my-es": {
"es-emitter": {
"esUrl": "https://localhost:9200/my-index",
"idField": "_id",
"attachmentStrategy": "SEPARATE_DOCUMENTS",
"updateStrategy": "UPSERT",
"embeddedFileFieldName": "embedded",
"apiKey": "<base64-encoded id:api_key>"
}
}
}
| Field | Default | Description |
|---|---|---|
|
required |
Full URL including the index name, e.g. |
|
|
Metadata field used as the document |
|
|
How embedded documents are stored. |
|
|
|
|
|
Name of the join-field relation used in |
|
none |
Base64-encoded |
|
none |
Optional block for |
|
By default ( |
ES Pipes Reporter (es-pipes-reporter)
The ES reporter writes per-document parse status back into the same index, so you can query the processing outcome alongside the extracted content.
"pipes-reporters": {
"es-pipes-reporter": {
"esUrl": "https://localhost:9200/my-index",
"keyPrefix": "tika_",
"includeRouting": false
}
}
The reporter adds <keyPrefix>parse_status, <keyPrefix>parse_time_ms,
and (when the forked JVM exits abnormally) <keyPrefix>exit_value fields
to each document via an upsert.
Advanced Topics
-
Shared Server Mode - Experimental mode for reduced memory usage