Elasticsearch Plugin
The Elasticsearch plugin (tika-pipes-es) provides an emitter (writes parsed docs to an Elasticsearch index) and a reporter (writes per-document processing status to Elasticsearch).
It mirrors the OpenSearch plugin in structure. The field names differ — esUrl instead of openSearchUrl — and ES adds an apiKey field for ApiKey-based auth in addition to basic auth.
| Interface | Component name | Class |
|---|---|---|
Emitter |
|
|
Reporter |
|
|
Authentication
Two auth modes are supported, in this priority order:
-
ApiKey — set the top-level
apiKeyfield to the Base64-encodedid:api_keystring Elasticsearch generates. Sent asAuthorization: ApiKey <value>. -
Basic — leave
apiKeynull/empty and setuserName+passwordinsidehttpClientConfig. Sent asAuthorization: Basic ….
The emitter overrides toString() to redact the apiKey value, so it does not leak into logs.
Shared HTTP Client Settings
Both the emitter and the reporter accept a nested httpClientConfig block with these fields:
| Field | Default | Description |
|---|---|---|
|
optional |
Basic-auth credentials. Used only when |
|
optional |
Set to |
|
no default |
HTTP connect timeout, in milliseconds. |
|
no default |
HTTP socket read timeout, in milliseconds. |
|
optional |
Optional outbound HTTP proxy. |
Elasticsearch Emitter (es-emitter)
Writes parsed documents to an Elasticsearch index.
{
"emitters": {
"ese": {
"es-emitter": {
"esUrl": "https://es.example.com:9200/tika-docs",
"idField": "doc_id",
"attachmentStrategy": "PARENT_CHILD",
"updateStrategy": "OVERWRITE",
"commitWithin": 1000,
"embeddedFileFieldName": "embedded",
"apiKey": "REDACTED_BASE64_ID_AND_KEY",
"httpClientConfig": {
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Full URL of the target Elasticsearch index, e.g., |
|
required |
Field in the emitted JSON document that holds the Elasticsearch |
|
no default |
How attached/embedded documents are indexed. One of: * |
|
no default |
How existing documents are handled. One of: * |
|
no default |
Kept for API parity with the OpenSearch emitter. ES does not consume this value. |
|
no default |
Name of the field used to hold embedded-file content (used by |
|
optional |
Base64-encoded |
|
optional |
Elasticsearch Reporter (es-pipes-reporter)
Writes per-document processing status records to an Elasticsearch index. Useful for building dashboards over pipeline activity.
{
"pipes-reporters": {
"es-pipes-reporter": {
"esUrl": "https://es.example.com:9200/tika-status",
"includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
"keyPrefix": "tika_",
"includeRouting": true,
"apiKey": "REDACTED_BASE64_ID_AND_KEY",
"httpClientConfig": {
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
}
}
pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Full URL of the status index, e.g., |
|
optional |
Set of |
|
optional |
Set of |
|
optional |
Prefix prepended to status field names in the emitted documents. |
|
|
If |
|
optional |
Base64-encoded |
|
optional |
Complete Pipeline Example
The example below combines a filesystem iterator/fetcher with the Elasticsearch emitter and reporter — a common pattern for ingesting a directory of documents into ES.
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "/data/input",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"ese": {
"es-emitter": {
"esUrl": "https://es.example.com:9200/tika-docs",
"idField": "doc_id",
"attachmentStrategy": "PARENT_CHILD",
"updateStrategy": "OVERWRITE",
"commitWithin": 1000,
"embeddedFileFieldName": "embedded",
"apiKey": "REDACTED_BASE64_ID_AND_KEY",
"httpClientConfig": {
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/data/input",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "ese"
}
},
"pipes-reporters": {
"es-pipes-reporter": {
"esUrl": "https://es.example.com:9200/tika-status",
"includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
"keyPrefix": "tika_",
"includeRouting": true,
"apiKey": "REDACTED_BASE64_ID_AND_KEY",
"httpClientConfig": {
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4
}
}
Notes
-
The ES plugin’s HTTP client is REST-based; it does not depend on the Elasticsearch transport client.
-
For OpenSearch deployments, use the parallel OpenSearch plugin instead — the field names differ (
openSearchUrlvs.esUrl). -
Don’t check real credentials into source control — the
apiKeyandpasswordvalues in the examples above are placeholders.