OpenSearch Plugin

The OpenSearch plugin (tika-pipes-opensearch) provides an emitter (writes parsed docs to an OpenSearch index) and a reporter (writes per-document processing status to OpenSearch).

Interface Component name Class

Emitter

opensearch-emitter

OpenSearchEmitter

Reporter

opensearch-pipes-reporter

OpenSearchPipesReporter

Shared HTTP Client Settings

Both the emitter and the reporter accept a nested httpClientConfig block with these fields:

Field Default Description

userName / password

optional

Basic-auth credentials. Omit both for an anonymous client.

authScheme

optional

Set to basic to send credentials preemptively.

connectionTimeoutMillis

no default

HTTP connect timeout, in milliseconds.

socketTimeoutMillis

no default

HTTP socket read timeout, in milliseconds.

proxyHost / proxyPort

optional

Optional outbound HTTP proxy.

OpenSearch Emitter (opensearch-emitter)

Writes parsed documents to an OpenSearch index.

{
  "emitters": {
    "ose": {
      "opensearch-emitter": {
        "openSearchUrl": "https://opensearch.example.com:9200/tika-docs",
        "idField": "doc_id",
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "OVERWRITE",
        "commitWithin": 1000,
        "embeddedFileFieldName": "embedded",
        "httpClientConfig": {
          "userName": "admin",
          "password": "REDACTED",
          "authScheme": "basic",
          "connectionTimeoutMillis": 10000,
          "socketTimeoutMillis": 60000
        }
      }
    }
  }
}

Configuration

Field Default Description

openSearchUrl

required

Full URL of the target OpenSearch index, e.g., https://opensearch.example.com:9200/tika-docs.

idField

required

Field in the emitted JSON document that holds the OpenSearch _id.

attachmentStrategy

no default

How attached/embedded documents are indexed. One of:

* SEPARATE_DOCUMENTS — each attachment becomes its own top-level document. * PARENT_CHILD — attachments are nested under the parent in a parent/child relation.

updateStrategy

no default

How existing documents are handled. One of:

* OVERWRITE — replaces an existing document at _id. * UPSERT — merges into an existing document.

commitWithin

no default

Maximum delay before the index refresh becomes visible, in milliseconds (passed to OpenSearch’s refresh semantics).

embeddedFileFieldName

no default

Name of the field used to hold embedded-file content (used by PARENT_CHILD).

httpClientConfig

optional

See Shared HTTP Client Settings.

OpenSearch Reporter (opensearch-pipes-reporter)

Writes per-document processing status records to an OpenSearch index. Useful for building dashboards over pipeline activity.

{
  "pipes-reporters": {
    "opensearch-pipes-reporter": {
      "openSearchUrl": "https://opensearch.example.com:9200/tika-status",
      "includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
      "keyPrefix": "tika_",
      "includeRouting": true,
      "httpClientConfig": {
        "userName": "admin",
        "password": "REDACTED",
        "authScheme": "basic",
        "connectionTimeoutMillis": 10000,
        "socketTimeoutMillis": 60000
      }
    }
  }
}

pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.

Configuration

Field Default Description

openSearchUrl

required

Full URL of the status index, e.g., https://opensearch.example.com:9200/tika-status.

includes

optional

Set of RESULT_STATUS names to include (e.g., PARSE_SUCCESS, PARSE_EXCEPTION). If unset, all are reported.

excludes

optional

Set of RESULT_STATUS names to skip. Applied after includes.

keyPrefix

optional

Prefix prepended to status field names in the emitted documents.

includeRouting

false

If true, include OpenSearch routing info in each status record.

httpClientConfig

optional

See Shared HTTP Client Settings.

Complete Pipeline Example

The example below combines a filesystem iterator/fetcher with the OpenSearch emitter and reporter — a common pattern for ingesting a directory of documents into an OpenSearch index.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "/data/input",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "ose": {
      "opensearch-emitter": {
        "openSearchUrl": "https://opensearch.example.com:9200/tika-docs",
        "idField": "doc_id",
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "OVERWRITE",
        "commitWithin": 1000,
        "embeddedFileFieldName": "embedded",
        "httpClientConfig": {
          "userName": "admin",
          "password": "REDACTED",
          "authScheme": "basic",
          "connectionTimeoutMillis": 10000,
          "socketTimeoutMillis": 60000
        }
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "ose"
    }
  },
  "pipes-reporters": {
    "opensearch-pipes-reporter": {
      "openSearchUrl": "https://opensearch.example.com:9200/tika-status",
      "includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
      "keyPrefix": "tika_",
      "includeRouting": true,
      "httpClientConfig": {
        "userName": "admin",
        "password": "REDACTED",
        "authScheme": "basic",
        "connectionTimeoutMillis": 10000,
        "socketTimeoutMillis": 60000
      }
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

  • The OpenSearch plugin’s HTTP client is REST-based; it does not depend on the OpenSearch transport client.

  • For Elasticsearch deployments, use the parallel Elasticsearch plugin instead — the field names differ (esUrl vs. openSearchUrl) and ES adds API-key auth.

  • Don’t check real credentials into source control — the password values in the examples above are placeholders.