Fetchers
Fetchers retrieve document bytes from a source. Each fetcher is identified by
its component name and an id that is referenced by the pipes iterator.
File System Fetcher (file-system-fetcher)
Reads files from a local or mounted filesystem.
Module: tika-pipes-file-system
{
"fetchers": [
{
"file-system-fetcher": {
"id": "my-fetcher",
"basePath": "/data/documents",
"extractFileSystemMetadata": true
}
}
]
}
| Field | Default | Description |
|---|---|---|
|
required |
Base directory. Fetch keys are resolved relative to this path. |
|
|
Extract file created/modified timestamps and size into metadata. |
|
|
Allow absolute fetch keys when |
S3 Fetcher (s3-fetcher)
Fetches objects from Amazon S3.
Module: tika-pipes-s3
| Field | Default | Description |
|---|---|---|
|
required |
S3 bucket name. |
|
required |
AWS region (e.g., |
|
|
Credentials type: |
|
|
AWS profile name (when using |
|
none |
Static credentials (when using |
|
none |
S3 key prefix. |
|
|
Spool object to a temp file before parsing. |
|
|
Extract S3 user metadata. |
|
unlimited |
Maximum object size to fetch. |
HTTP Fetcher (http-fetcher)
Fetches documents from HTTP/HTTPS URLs.
Module: tika-pipes-http
| Field | Default | Description |
|---|---|---|
|
none |
Basic auth username. |
|
none |
Basic auth password. |
|
|
Connection timeout. |
|
|
Socket read timeout. |
|
|
Maximum concurrent connections. |
|
default |
HTTP User-Agent header. |
GCS Fetcher (gcs-fetcher)
Fetches objects from Google Cloud Storage.
Module: tika-pipes-gcs
| Field | Default | Description |
|---|---|---|
|
required |
GCP project ID. |
|
required |
GCS bucket name. |
|
none |
Key prefix. |
|
|
Spool to temp file before parsing. |
|
|
Extract GCS user metadata. |
Azure Blob Fetcher (az-blob-fetcher)
Fetches blobs from Azure Blob Storage.
Module: tika-pipes-az-blob
| Field | Default | Description |
|---|---|---|
|
required |
Shared Access Signature token. |
|
required |
Azure storage endpoint URL. |
|
required |
Container name. |
|
none |
Blob prefix. |
|
|
Extract Azure user metadata. |
Google Drive Fetcher (google-drive-fetcher)
Fetches files from Google Drive via the Drive API.
Module: tika-pipes-google-drive
| Field | Default | Description |
|---|---|---|
|
required |
Path to GCP service account JSON key file. |
|
none |
User email to impersonate (for domain-wide delegation). |
Microsoft Graph Fetcher (microsoft-graph-fetcher)
Fetches files from Microsoft 365 (OneDrive, SharePoint) via the Graph API.
Module: tika-pipes-microsoft-graph
Atlassian JWT Fetcher (atlassian-jwt-fetcher)
Fetches content from Atlassian products using JWT authentication.
Module: tika-pipes-atlassian-jwt
| Field | Default | Description |
|---|---|---|
|
required |
JWT shared secret. |
|
required |
JWT issuer / app key. |
|
|
Connection timeout. |
|
|
Socket read timeout. |