Tika Pipes
This section covers Tika Pipes for scalable, fault-tolerant document processing.
Overview
Tika Pipes provides a framework for processing large volumes of documents with:
-
Fetchers - Retrieve documents from various sources (filesystem, S3, HTTP, etc.)
-
Emitters - Send parsed results to various destinations (filesystem, OpenSearch, ES-compatible, Solr, etc.)
-
Pipelines - Configure processing workflows
Topics
-
Parse Modes - Control how documents are parsed and emitted (
RMETA,CONCATENATE,CONTENT_ONLY,UNPACK) -
Extracting Embedded Bytes - Extract raw bytes from embedded documents using
ParseMode.UNPACK -
Timeouts - Two-tier timeout system for handling long-running and hung parsers
Emitters
ES Emitter (es-emitter)
The ES emitter sends parsed documents to any ES-compatible REST API (ES 7+/8+) via
the _bulk endpoint. It uses plain HTTP (Apache HttpClient) — there is no dependency
on the ES Java client, which carries a non-ASL license.
"emitters": {
"my-es": {
"es-emitter": {
"esUrl": "https://localhost:9200/my-index",
"idField": "_id",
"attachmentStrategy": "SEPARATE_DOCUMENTS",
"updateStrategy": "UPSERT",
"embeddedFileFieldName": "embedded",
"apiKey": "<base64-encoded id:api_key>"
}
}
}
| Field | Default | Description |
|---|---|---|
|
required |
Full URL including the index name, e.g. |
|
|
Metadata field used as the document |
|
|
How embedded documents are stored. |
|
|
|
|
|
Name of the join-field relation used in |
|
none |
Base64-encoded |
|
none |
Optional block for |
|
By default ( |
ES Pipes Reporter (es-pipes-reporter)
The ES reporter writes per-document parse status back into the same index, so you can query the processing outcome alongside the extracted content.
"pipes-reporters": {
"es-pipes-reporter": {
"esUrl": "https://localhost:9200/my-index",
"keyPrefix": "tika_",
"includeRouting": false
}
}
The reporter adds <keyPrefix>parse_status, <keyPrefix>parse_time_ms,
and (when the forked JVM exits abnormally) <keyPrefix>exit_value fields
to each document via an upsert.
Advanced Topics
-
Shared Server Mode - Experimental mode for reduced memory usage