Fetchers

A fetcher retrieves the bytes of a document from a source — a local filesystem, an S3 bucket, an HTTP URL, etc. — and returns them as an InputStream to the parser.

The Fetcher Contract

Each fetcher implements Fetcher#fetch(String fetchKey, Metadata metadata, ParseContext parseContext) and returns an InputStream for the named document. The shape of the fetch key depends on the fetcher: for the file-system fetcher it is a path relative to basePath; for the S3 fetcher it is an object key relative to prefix; for the HTTP fetcher it is the URL itself.

Fetchers are stateless from the pipeline’s perspective — every fetch() call resolves the key independently, so iterators are free to parallelize fetches.

Wiring Fetchers Into a Pipeline

Fetchers live under the top-level fetchers key. Each fetcher gets an ID (the outer map key) and a type-name (the inner map key); the iterator then references the ID through its fetcherId field.

{
  "fetchers": {
    "primary": {
      "file-system-fetcher": {
        "basePath": "/data/input"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "fetcherId": "primary",
      "emitterId": "..."
    }
  }
}

A single pipes config may declare multiple fetchers with different IDs and use them in different iterators or pipelines.

Available Fetchers

Plugin Component name Notes

File System

file-system-fetcher

Local / mounted filesystem.

Amazon S3

s3-fetcher

S3 or S3-compatible (MinIO, LocalStack).

Google Cloud Storage

gcs-fetcher

GCS via Application Default Credentials.

Azure Blob Storage

az-blob-fetcher

SAS-token auth.

HTTP

http-fetcher

HTTP(S) with basic / JWT auth.

Google Drive

google-drive-fetcher

Drive API with service-account auth.

Microsoft Graph

microsoft-graph-fetcher

OneDrive / SharePoint via Graph.

Atlassian JWT

atlassian-jwt-fetcher

Atlassian Connect (Jira/Confluence Cloud).

For the full plugin / interface matrix, see Plugins.