Fetchers

Table of Contents

The Fetcher Contract
Wiring Fetchers Into a Pipeline
Available Fetchers

A fetcher retrieves the bytes of a document from a source — a local filesystem, an S3 bucket, an HTTP URL, etc. — and returns them as an InputStream to the parser.

The Fetcher Contract

Each fetcher implements Fetcher#fetch(String fetchKey, Metadata metadata, ParseContext parseContext) and returns an InputStream for the named document. The shape of the fetch key depends on the fetcher: for the file-system fetcher it is a path relative to basePath; for the S3 fetcher it is an object key relative to prefix; for the HTTP fetcher it is the URL itself.

Fetchers are stateless from the pipeline’s perspective — every fetch() call resolves the key independently, so iterators are free to parallelize fetches.

Wiring Fetchers Into a Pipeline

Fetchers live under the top-level fetchers key. Each fetcher gets an ID (the outer map key) and a type-name (the inner map key); the iterator then references the ID through its fetcherId field.

{
  "fetchers": {
    "primary": {
      "file-system-fetcher": {
        "basePath": "/data/input"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "fetcherId": "primary",
      "emitterId": "..."
    }
  }
}

A single pipes config may declare multiple fetchers with different IDs and use them in different iterators or pipelines.

Available Fetchers

Plugin Component name Notes

Plugin	Component name	Notes
File System	`file-system-fetcher`	Local / mounted filesystem.
Amazon S3	`s3-fetcher`	S3 or S3-compatible (MinIO, LocalStack).
Google Cloud Storage	`gcs-fetcher`	GCS via Application Default Credentials.
Azure Blob Storage	`az-blob-fetcher`	SAS-token auth.
HTTP	`http-fetcher`	HTTP(S) with basic / JWT auth.
Google Drive	`google-drive-fetcher`	Drive API with service-account auth.
Microsoft Graph	`microsoft-graph-fetcher`	OneDrive / SharePoint via Graph.
Atlassian JWT	`atlassian-jwt-fetcher`	Atlassian Connect (Jira/Confluence Cloud).

File System

file-system-fetcher

Local / mounted filesystem.

Amazon S3

s3-fetcher

S3 or S3-compatible (MinIO, LocalStack).

Google Cloud Storage

gcs-fetcher

GCS via Application Default Credentials.

Azure Blob Storage

az-blob-fetcher

SAS-token auth.