OpenSearch Plugin
The OpenSearch plugin (tika-pipes-opensearch) provides an emitter (writes parsed docs to an OpenSearch index) and a reporter (writes per-document processing status to OpenSearch).
| Interface | Component name | Class |
|---|---|---|
Emitter |
|
|
Reporter |
|
|
Shared HTTP Client Settings
Both the emitter and the reporter accept a nested httpClientConfig block with these fields:
| Field | Default | Description |
|---|---|---|
|
optional |
Basic-auth credentials. Omit both for an anonymous client. |
|
optional |
Set to |
|
no default |
HTTP connect timeout, in milliseconds. |
|
no default |
HTTP socket read timeout, in milliseconds. |
|
optional |
Optional outbound HTTP proxy. |
OpenSearch Emitter (opensearch-emitter)
Writes parsed documents to an OpenSearch index.
{
"emitters": {
"ose": {
"opensearch-emitter": {
"openSearchUrl": "https://opensearch.example.com:9200/tika-docs",
"idField": "doc_id",
"attachmentStrategy": "PARENT_CHILD",
"updateStrategy": "OVERWRITE",
"commitWithin": 1000,
"embeddedFileFieldName": "embedded",
"httpClientConfig": {
"userName": "admin",
"password": "REDACTED",
"authScheme": "basic",
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Full URL of the target OpenSearch index, e.g., |
|
required |
Field in the emitted JSON document that holds the OpenSearch |
|
no default |
How attached/embedded documents are indexed. One of: * |
|
no default |
How existing documents are handled. One of: * |
|
no default |
Maximum delay before the index refresh becomes visible, in milliseconds (passed to OpenSearch’s |
|
no default |
Name of the field used to hold embedded-file content (used by |
|
optional |
OpenSearch Reporter (opensearch-pipes-reporter)
Writes per-document processing status records to an OpenSearch index. Useful for building dashboards over pipeline activity.
{
"pipes-reporters": {
"opensearch-pipes-reporter": {
"openSearchUrl": "https://opensearch.example.com:9200/tika-status",
"includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
"keyPrefix": "tika_",
"includeRouting": true,
"httpClientConfig": {
"userName": "admin",
"password": "REDACTED",
"authScheme": "basic",
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
}
}
pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Full URL of the status index, e.g., |
|
optional |
Set of |
|
optional |
Set of |
|
optional |
Prefix prepended to status field names in the emitted documents. |
|
|
If |
|
optional |
Complete Pipeline Example
The example below combines a filesystem iterator/fetcher with the OpenSearch emitter and reporter — a common pattern for ingesting a directory of documents into an OpenSearch index.
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "/data/input",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"ose": {
"opensearch-emitter": {
"openSearchUrl": "https://opensearch.example.com:9200/tika-docs",
"idField": "doc_id",
"attachmentStrategy": "PARENT_CHILD",
"updateStrategy": "OVERWRITE",
"commitWithin": 1000,
"embeddedFileFieldName": "embedded",
"httpClientConfig": {
"userName": "admin",
"password": "REDACTED",
"authScheme": "basic",
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/data/input",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "ose"
}
},
"pipes-reporters": {
"opensearch-pipes-reporter": {
"openSearchUrl": "https://opensearch.example.com:9200/tika-status",
"includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
"keyPrefix": "tika_",
"includeRouting": true,
"httpClientConfig": {
"userName": "admin",
"password": "REDACTED",
"authScheme": "basic",
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4
}
}
Notes
-
The OpenSearch plugin’s HTTP client is REST-based; it does not depend on the OpenSearch transport client.
-
For Elasticsearch deployments, use the parallel Elasticsearch plugin instead — the field names differ (
esUrlvs.openSearchUrl) and ES adds API-key auth. -
Don’t check real credentials into source control — the
passwordvalues in the examples above are placeholders.