Fetchers

Fetchers retrieve document bytes from a source. Each fetcher is identified by its component name and an id that is referenced by the pipes iterator.

File System Fetcher (file-system-fetcher)

Reads files from a local or mounted filesystem.

Module: tika-pipes-file-system

{
  "fetchers": [
    {
      "file-system-fetcher": {
        "id": "my-fetcher",
        "basePath": "/data/documents",
        "extractFileSystemMetadata": true
      }
    }
  ]
}
Field Default Description

basePath

required

Base directory. Fetch keys are resolved relative to this path.

extractFileSystemMetadata

false

Extract file created/modified timestamps and size into metadata.

allowAbsolutePaths

false

Allow absolute fetch keys when basePath is not set.

S3 Fetcher (s3-fetcher)

Fetches objects from Amazon S3.

Module: tika-pipes-s3

Field Default Description

bucket

required

S3 bucket name.

region

required

AWS region (e.g., us-east-1).

credentialsProvider

profile

Credentials type: profile, static, instance.

profile

default

AWS profile name (when using profile credentials).

accessKey / secretKey

none

Static credentials (when using static credentials).

prefix

none

S3 key prefix.

spoolToTemp

false

Spool object to a temp file before parsing.

extractUserMetadata

false

Extract S3 user metadata.

maxLength

unlimited

Maximum object size to fetch.

HTTP Fetcher (http-fetcher)

Fetches documents from HTTP/HTTPS URLs.

Module: tika-pipes-http

Field Default Description

userName

none

Basic auth username.

password

none

Basic auth password.

connectTimeoutMillis

30000

Connection timeout.

socketTimeoutMillis

120000

Socket read timeout.

maxConnections

200

Maximum concurrent connections.

userAgent

default

HTTP User-Agent header.

GCS Fetcher (gcs-fetcher)

Fetches objects from Google Cloud Storage.

Module: tika-pipes-gcs

Field Default Description

projectId

required

GCP project ID.

bucket

required

GCS bucket name.

prefix

none

Key prefix.

spoolToTemp

false

Spool to temp file before parsing.

extractUserMetadata

false

Extract GCS user metadata.

Azure Blob Fetcher (az-blob-fetcher)

Fetches blobs from Azure Blob Storage.

Module: tika-pipes-az-blob

Field Default Description

sasToken

required

Shared Access Signature token.

endpoint

required

Azure storage endpoint URL.

container

required

Container name.

prefix

none

Blob prefix.

extractUserMetadata

false

Extract Azure user metadata.

Google Drive Fetcher (google-drive-fetcher)

Fetches files from Google Drive via the Drive API.

Module: tika-pipes-google-drive

Field Default Description

serviceAccountCredentialsPath

required

Path to GCP service account JSON key file.

impersonatedUser

none

User email to impersonate (for domain-wide delegation).

Microsoft Graph Fetcher (microsoft-graph-fetcher)

Fetches files from Microsoft 365 (OneDrive, SharePoint) via the Graph API.

Module: tika-pipes-microsoft-graph

Atlassian JWT Fetcher (atlassian-jwt-fetcher)

Fetches content from Atlassian products using JWT authentication.

Module: tika-pipes-atlassian-jwt

Field Default Description

sharedSecret

required

JWT shared secret.

issuer

required

JWT issuer / app key.

connectTimeoutMillis

30000

Connection timeout.

socketTimeoutMillis

120000

Socket read timeout.