Pipes Iterators
A pipes iterator enumerates the documents to be processed. It emits one FetchEmitTuple per document; the pipeline workers then call the bound fetcher (to get the bytes), the parser, and the bound emitter (to write the result).
The Iterator Contract
A PipesIterator produces a stream of FetchEmitTuple records. Each tuple carries:
-
the fetch key — passed to the fetcher to retrieve the document bytes
-
the emit key — passed to the emitter to decide where to write results
-
an optional
idand arbitrary metadata fields
The iterator runs on its own thread; the pipeline reads tuples as fast as the worker pool can keep up.
Wiring an Iterator Into a Pipeline
The iterator lives under the singular top-level pipes-iterator key. The inner map key is the iterator’s component name. fetcherId and emitterId are flat fields on the iterator config, alongside the iterator-specific options:
{
"fetchers": { "fsf": { "file-system-fetcher": { "basePath": "/data/in" } } },
"emitters": { "fse": { "file-system-emitter": { "basePath": "/data/out" } } },
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/data/in",
"fetcherId": "fsf",
"emitterId": "fse"
}
}
}
Only one iterator is active per pipeline. To process multiple sources in parallel, run multiple pipelines.
Available Iterators
| Plugin | Component name | Notes |
|---|---|---|
|
Recursively walks a directory tree. |
|
|
Lists S3 objects under a prefix. |
|
|
Lists GCS objects under a prefix. |
|
|
Lists blobs under a prefix. |
|
|
Queries a Solr collection (useful for re-parsing). |
|
|
Walks rows from a SELECT query. |
|
|
Consumes fetch-request messages from a topic. |
|
|
Reads work items from a CSV file. |
|
|
Reads work items from a JSON-lines file. |
For the full plugin / interface matrix, see Plugins.