Pipes Iterators

A pipes iterator enumerates the documents to be processed. It emits one FetchEmitTuple per document; the pipeline workers then call the bound fetcher (to get the bytes), the parser, and the bound emitter (to write the result).

The Iterator Contract

A PipesIterator produces a stream of FetchEmitTuple records. Each tuple carries:

  • the fetch key — passed to the fetcher to retrieve the document bytes

  • the emit key — passed to the emitter to decide where to write results

  • an optional id and arbitrary metadata fields

The iterator runs on its own thread; the pipeline reads tuples as fast as the worker pool can keep up.

Wiring an Iterator Into a Pipeline

The iterator lives under the singular top-level pipes-iterator key. The inner map key is the iterator’s component name. fetcherId and emitterId are flat fields on the iterator config, alongside the iterator-specific options:

{
  "fetchers": { "fsf": { "file-system-fetcher": { "basePath": "/data/in" } } },
  "emitters": { "fse": { "file-system-emitter": { "basePath": "/data/out" } } },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/in",
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  }
}

Only one iterator is active per pipeline. To process multiple sources in parallel, run multiple pipelines.

Available Iterators

Plugin Component name Notes

File System

file-system-pipes-iterator

Recursively walks a directory tree.

Amazon S3

s3-pipes-iterator

Lists S3 objects under a prefix.

Google Cloud Storage

gcs-pipes-iterator

Lists GCS objects under a prefix.

Azure Blob Storage

az-blob-pipes-iterator

Lists blobs under a prefix.

Apache Solr

solr-pipes-iterator

Queries a Solr collection (useful for re-parsing).

JDBC

jdbc-pipes-iterator

Walks rows from a SELECT query.

Apache Kafka

kafka-pipes-iterator

Consumes fetch-request messages from a topic.

CSV

csv-pipes-iterator

Reads work items from a CSV file.

JSON

json-pipes-iterator

Reads work items from a JSON-lines file.

For the full plugin / interface matrix, see Plugins.