Pipes Iterators
- File System Iterator (
file-system-pipes-iterator) - S3 Iterator (
s3-pipes-iterator) - GCS Iterator (
gcs-pipes-iterator) - Azure Blob Iterator (
az-blob-pipes-iterator) - CSV Iterator (
csv-pipes-iterator) - JDBC Iterator (
jdbc-pipes-iterator) - Solr Iterator (
solr-pipes-iterator) - JSON Iterator (
json-pipes-iterator) - Kafka Iterator (
kafka-pipes-iterator)
Pipes iterators enumerate the documents to be processed. Each iterator produces fetch/emit tuples that the pipeline consumes.
All iterators share a baseConfig block that specifies which fetcher and emitter
to use:
"baseConfig": {
"fetcherId": "my-fetcher-id",
"emitterId": "my-emitter-id"
}
File System Iterator (file-system-pipes-iterator)
Recursively walks a directory tree.
Module: tika-pipes-file-system
| Field | Default | Description |
|---|---|---|
|
required |
Directory to walk. |
|
|
Count total files before processing (enables progress reporting). |
|
required |
Fetcher/emitter IDs. |
S3 Iterator (s3-pipes-iterator)
Lists objects in an S3 bucket.
Module: tika-pipes-s3
| Field | Default | Description |
|---|---|---|
|
required |
S3 bucket name. |
|
required |
AWS region. |
|
none |
Key prefix to filter objects. |
|
|
Credentials type. |
|
required |
Fetcher/emitter IDs. |
GCS Iterator (gcs-pipes-iterator)
Lists objects in a Google Cloud Storage bucket.
Module: tika-pipes-gcs
Azure Blob Iterator (az-blob-pipes-iterator)
Lists blobs in an Azure Blob Storage container.
Module: tika-pipes-az-blob
CSV Iterator (csv-pipes-iterator)
Reads rows from a CSV file to generate fetch/emit tuples.
Module: tika-pipes-csv
| Field | Default | Description |
|---|---|---|
|
required |
Path to the CSV file. |
|
required |
Column name containing the fetch key (file path, S3 key, etc.). |
|
none |
Column name for the emit key. If omitted, uses the fetch key. |
|
required |
Fetcher/emitter IDs. |
JDBC Iterator (jdbc-pipes-iterator)
Executes a SQL query and uses each row as a fetch/emit tuple.
Module: tika-pipes-jdbc
| Field | Default | Description |
|---|---|---|
|
required |
JDBC connection string. |
|
required |
SQL SELECT query. |
|
required |
Column containing the fetch key. |
|
none |
Column containing the document ID. |
|
required |
Fetcher/emitter IDs. |
Solr Iterator (solr-pipes-iterator)
Queries a Solr collection and uses each document as a fetch/emit tuple.
Module: tika-pipes-solr
JSON Iterator (json-pipes-iterator)
Reads an array of objects from a JSON file.
Module: tika-pipes-json
| Field | Default | Description |
|---|---|---|
|
required |
Path to the JSON file. |
|
required |
Fetcher/emitter IDs. |
Kafka Iterator (kafka-pipes-iterator)
Consumes messages from a Kafka topic as fetch/emit tuples.
Module: tika-pipes-kafka
| Field | Default | Description |
|---|---|---|
|
required |
Kafka topic. |
|
required |
Kafka broker addresses. |
|
required |
Consumer group ID. |
|
|
Where to start reading: |
|
required |
Fetcher/emitter IDs. |