Fetchers
A fetcher retrieves the bytes of a document from a source — a local filesystem, an S3 bucket, an HTTP URL, etc. — and returns them as an InputStream to the parser.
The Fetcher Contract
Each fetcher implements Fetcher#fetch(String fetchKey, Metadata metadata, ParseContext parseContext) and returns an InputStream for the named document. The shape of the fetch key depends on the fetcher: for the file-system fetcher it is a path relative to basePath; for the S3 fetcher it is an object key relative to prefix; for the HTTP fetcher it is the URL itself.
Fetchers are stateless from the pipeline’s perspective — every fetch() call resolves the key independently, so iterators are free to parallelize fetches.
Wiring Fetchers Into a Pipeline
Fetchers live under the top-level fetchers key. Each fetcher gets an ID (the outer map key) and a type-name (the inner map key); the iterator then references the ID through its fetcherId field.
{
"fetchers": {
"primary": {
"file-system-fetcher": {
"basePath": "/data/input"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/data/input",
"fetcherId": "primary",
"emitterId": "..."
}
}
}
A single pipes config may declare multiple fetchers with different IDs and use them in different iterators or pipelines.
Available Fetchers
| Plugin | Component name | Notes |
|---|---|---|
|
Local / mounted filesystem. |
|
|
S3 or S3-compatible (MinIO, LocalStack). |
|
|
GCS via Application Default Credentials. |
|
|
SAS-token auth. |
|
|
HTTP(S) with basic / JWT auth. |
|
|
Drive API with service-account auth. |
|
|
OneDrive / SharePoint via Graph. |
|
|
Atlassian Connect (Jira/Confluence Cloud). |
For the full plugin / interface matrix, see Plugins.