Emitters

Table of Contents

The Emitter Contract
Wiring Emitters Into a Pipeline
Available Emitters

An emitter writes parse results to a destination — a file on disk, a row in a database, a document in a search index, a message on a queue, etc.

The Emitter Contract

Each emitter implements Emitter#emit(EmitData emitData), where EmitData carries the emit key, the parsed Metadata, and (for content-emitting strategies) the extracted content.

The emit key is supplied by the iterator on each FetchEmitTuple and tells the emitter where to put the result. Its shape depends on the emitter:

file-system / S3 / GCS / Azure Blob — a key/path relative to basePath or prefix.
OpenSearch / Elasticsearch / Solr — the _id field value, taken from the metadata field named by the emitter’s idField.
JDBC — the value bound to the first ? placeholder in the insert template.
Kafka — the Kafka record key.

Emitters are intended to be safe under concurrent use; the pipeline’s worker pool may call emit() from many threads.

Wiring Emitters Into a Pipeline

Emitters live under the top-level emitters key. Each emitter gets an ID (the outer map key) and a type-name (the inner map key); the iterator references the ID through its emitterId field.

{
  "emitters": {
    "output": {
      "file-system-emitter": {
        "basePath": "/data/output",
        "fileExtension": "json"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "fetcherId": "...",
      "emitterId": "output"
    }
  }
}

A pipeline may declare multiple emitters and choose between them at iterator-config time. Within a single iterator, each emitted FetchEmitTuple carries exactly one emitter ID.

Available Emitters

Plugin Component name Notes

Plugin	Component name	Notes
File System	`file-system-emitter`	Local / mounted filesystem.
Amazon S3	`s3-emitter`	S3 or S3-compatible.
Google Cloud Storage	`gcs-emitter`	GCS via ADC.
Azure Blob Storage	`az-blob-emitter`	SAS-token auth.
OpenSearch	`opensearch-emitter`	REST-based bulk indexing.
Elasticsearch	`es-emitter`	REST-based bulk indexing; ApiKey or basic auth.
Apache Solr	`solr-emitter`	SolrCloud (URLs or ZooKeeper).
JDBC	`jdbc-emitter`	Any RDBMS with a JDBC driver.
Apache Kafka	`kafka-emitter`	Standard Kafka producer.

File System

file-system-emitter

Local / mounted filesystem.

Amazon S3

s3-emitter

S3 or S3-compatible.

Google Cloud Storage

gcs-emitter

GCS via ADC.

Azure Blob Storage

az-blob-emitter

SAS-token auth.