Pipes Iterators

Pipes iterators enumerate the documents to be processed. Each iterator produces fetch/emit tuples that the pipeline consumes.

All iterators share a baseConfig block that specifies which fetcher and emitter to use:

"baseConfig": {
  "fetcherId": "my-fetcher-id",
  "emitterId": "my-emitter-id"
}

File System Iterator (file-system-pipes-iterator)

Recursively walks a directory tree.

Module: tika-pipes-file-system

Field Default Description

basePath

required

Directory to walk.

countTotal

false

Count total files before processing (enables progress reporting).

baseConfig

required

Fetcher/emitter IDs.

S3 Iterator (s3-pipes-iterator)

Lists objects in an S3 bucket.

Module: tika-pipes-s3

Field Default Description

bucket

required

S3 bucket name.

region

required

AWS region.

prefix

none

Key prefix to filter objects.

credentialsProvider

profile

Credentials type.

baseConfig

required

Fetcher/emitter IDs.

GCS Iterator (gcs-pipes-iterator)

Lists objects in a Google Cloud Storage bucket.

Module: tika-pipes-gcs

Azure Blob Iterator (az-blob-pipes-iterator)

Lists blobs in an Azure Blob Storage container.

Module: tika-pipes-az-blob

CSV Iterator (csv-pipes-iterator)

Reads rows from a CSV file to generate fetch/emit tuples.

Module: tika-pipes-csv

Field Default Description

csvPath

required

Path to the CSV file.

fetchKeyColumn

required

Column name containing the fetch key (file path, S3 key, etc.).

emitKeyColumn

none

Column name for the emit key. If omitted, uses the fetch key.

baseConfig

required

Fetcher/emitter IDs.

JDBC Iterator (jdbc-pipes-iterator)

Executes a SQL query and uses each row as a fetch/emit tuple.

Module: tika-pipes-jdbc

Field Default Description

connection

required

JDBC connection string.

select

required

SQL SELECT query.

fetchKeyColumn

required

Column containing the fetch key.

idColumn

none

Column containing the document ID.

baseConfig

required

Fetcher/emitter IDs.

Solr Iterator (solr-pipes-iterator)

Queries a Solr collection and uses each document as a fetch/emit tuple.

Module: tika-pipes-solr

JSON Iterator (json-pipes-iterator)

Reads an array of objects from a JSON file.

Module: tika-pipes-json

Field Default Description

jsonPath

required

Path to the JSON file.

baseConfig

required

Fetcher/emitter IDs.

Kafka Iterator (kafka-pipes-iterator)

Consumes messages from a Kafka topic as fetch/emit tuples.

Module: tika-pipes-kafka

Field Default Description

topic

required

Kafka topic.

bootstrapServers

required

Kafka broker addresses.

groupId

required

Consumer group ID.

autoOffsetReset

earliest

Where to start reading: earliest or latest.

baseConfig

required

Fetcher/emitter IDs.