Elasticsearch Plugin

The Elasticsearch plugin (tika-pipes-es) provides an emitter (writes parsed docs to an Elasticsearch index) and a reporter (writes per-document processing status to Elasticsearch).

It mirrors the OpenSearch plugin in structure. The field names differ — esUrl instead of openSearchUrl — and ES adds an apiKey field for ApiKey-based auth in addition to basic auth.

Interface Component name Class

Emitter

es-emitter

ESEmitter

Reporter

es-pipes-reporter

ESPipesReporter

Authentication

Two auth modes are supported, in this priority order:

  1. ApiKey — set the top-level apiKey field to the Base64-encoded id:api_key string Elasticsearch generates. Sent as Authorization: ApiKey <value>.

  2. Basic — leave apiKey null/empty and set userName + password inside httpClientConfig. Sent as Authorization: Basic …​.

The emitter overrides toString() to redact the apiKey value, so it does not leak into logs.

Shared HTTP Client Settings

Both the emitter and the reporter accept a nested httpClientConfig block with these fields:

Field Default Description

userName / password

optional

Basic-auth credentials. Used only when apiKey is unset.

authScheme

optional

Set to basic to send credentials preemptively.

connectionTimeoutMillis

no default

HTTP connect timeout, in milliseconds.

socketTimeoutMillis

no default

HTTP socket read timeout, in milliseconds.

proxyHost / proxyPort

optional

Optional outbound HTTP proxy.

Elasticsearch Emitter (es-emitter)

Writes parsed documents to an Elasticsearch index.

{
  "emitters": {
    "ese": {
      "es-emitter": {
        "esUrl": "https://es.example.com:9200/tika-docs",
        "idField": "doc_id",
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "OVERWRITE",
        "commitWithin": 1000,
        "embeddedFileFieldName": "embedded",
        "apiKey": "REDACTED_BASE64_ID_AND_KEY",
        "httpClientConfig": {
          "connectionTimeoutMillis": 10000,
          "socketTimeoutMillis": 60000
        }
      }
    }
  }
}

Configuration

Field Default Description

esUrl

required

Full URL of the target Elasticsearch index, e.g., https://es.example.com:9200/tika-docs.

idField

required

Field in the emitted JSON document that holds the Elasticsearch _id.

attachmentStrategy

no default

How attached/embedded documents are indexed. One of:

* SEPARATE_DOCUMENTS — each attachment becomes its own top-level document. * PARENT_CHILD — attachments are nested under the parent in a parent/child relation.

updateStrategy

no default

How existing documents are handled. One of:

* OVERWRITE — replaces an existing document at _id. * UPSERT — merges into an existing document.

commitWithin

no default

Kept for API parity with the OpenSearch emitter. ES does not consume this value.

embeddedFileFieldName

no default

Name of the field used to hold embedded-file content (used by PARENT_CHILD).

apiKey

optional

Base64-encoded id:api_key. See Authentication.

httpClientConfig

optional

See Shared HTTP Client Settings.

Elasticsearch Reporter (es-pipes-reporter)

Writes per-document processing status records to an Elasticsearch index. Useful for building dashboards over pipeline activity.

{
  "pipes-reporters": {
    "es-pipes-reporter": {
      "esUrl": "https://es.example.com:9200/tika-status",
      "includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
      "keyPrefix": "tika_",
      "includeRouting": true,
      "apiKey": "REDACTED_BASE64_ID_AND_KEY",
      "httpClientConfig": {
        "connectionTimeoutMillis": 10000,
        "socketTimeoutMillis": 60000
      }
    }
  }
}

pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.

Configuration

Field Default Description

esUrl

required

Full URL of the status index, e.g., https://es.example.com:9200/tika-status.

includes

optional

Set of RESULT_STATUS names to include (e.g., PARSE_SUCCESS, PARSE_EXCEPTION). If unset, all are reported.

excludes

optional

Set of RESULT_STATUS names to skip. Applied after includes.

keyPrefix

optional

Prefix prepended to status field names in the emitted documents.

includeRouting

false

If true, include ES routing info in each status record.

apiKey

optional

Base64-encoded id:api_key. See Authentication.

httpClientConfig

optional

See Shared HTTP Client Settings.

Complete Pipeline Example

The example below combines a filesystem iterator/fetcher with the Elasticsearch emitter and reporter — a common pattern for ingesting a directory of documents into ES.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "/data/input",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "ese": {
      "es-emitter": {
        "esUrl": "https://es.example.com:9200/tika-docs",
        "idField": "doc_id",
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "OVERWRITE",
        "commitWithin": 1000,
        "embeddedFileFieldName": "embedded",
        "apiKey": "REDACTED_BASE64_ID_AND_KEY",
        "httpClientConfig": {
          "connectionTimeoutMillis": 10000,
          "socketTimeoutMillis": 60000
        }
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "ese"
    }
  },
  "pipes-reporters": {
    "es-pipes-reporter": {
      "esUrl": "https://es.example.com:9200/tika-status",
      "includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
      "keyPrefix": "tika_",
      "includeRouting": true,
      "apiKey": "REDACTED_BASE64_ID_AND_KEY",
      "httpClientConfig": {
        "connectionTimeoutMillis": 10000,
        "socketTimeoutMillis": 60000
      }
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

  • The ES plugin’s HTTP client is REST-based; it does not depend on the Elasticsearch transport client.

  • For OpenSearch deployments, use the parallel OpenSearch plugin instead — the field names differ (openSearchUrl vs. esUrl).

  • Don’t check real credentials into source control — the apiKey and password values in the examples above are placeholders.