Amazon S3 Plugin
The Amazon S3 plugin (tika-pipes-s3) provides fetcher, emitter, and iterator interfaces for objects in S3 (or any S3-compatible service such as MinIO).
| Interface | Component name | Class |
|---|---|---|
Fetcher |
|
|
Emitter |
|
|
Iterator |
|
|
Credentials
All three components share the same credentialsProvider selector:
-
profile— reads credentials from the local AWS profile named byprofile(e.g.,default). -
instance— uses the instance/container role attached to the host (EC2 IAM role, ECS task role, etc.). No additional fields needed. -
key_secret— readsaccessKeyandsecretKeyfrom the config. Avoid checking these into source control; prefer environment-variable substitution or one of the other providers.
The emitter’s validate() enforces these values, but the fetcher and iterator do not — they fail later when the AWS SDK tries to resolve credentials.
S3 Fetcher (s3-fetcher)
Reads objects from an S3 bucket. The fetch key is the S3 key under prefix (if set).
{
"fetchers": {
"s3f": {
"s3-fetcher": {
"bucket": "my-tika-input",
"region": "us-east-1",
"prefix": "incoming/",
"credentialsProvider": "profile",
"profile": "default",
"extractUserMetadata": true,
"spoolToTemp": true
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
S3 bucket name. |
|
required |
AWS region (e.g., |
|
no default |
Optional key prefix. Fetch keys are resolved underneath this prefix. |
|
required |
One of |
|
conditional |
Required by the matching |
|
|
If |
|
|
If |
|
|
Maximum HTTP connections in the S3 client pool. |
|
|
Maximum object size, in bytes. |
|
no default |
Custom S3 endpoint, for S3-compatible services such as MinIO or LocalStack. |
|
|
Force path-style URLs (e.g., |
|
no default |
Optional rate-limit array; consecutive failures sleep for the corresponding number of seconds. |
S3 Emitter (s3-emitter)
Writes parsed results back to an S3 bucket. The emit key (relative to prefix) is derived from the FetchEmitTuple.
{
"emitters": {
"s3e": {
"s3-emitter": {
"bucket": "my-tika-output",
"region": "us-east-1",
"prefix": "results/",
"fileExtension": "json",
"credentialsProvider": "profile",
"profile": "default"
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Destination S3 bucket name (validated non-blank). |
|
required |
AWS region (validated non-blank). |
|
required |
One of |
|
conditional |
Required by the matching |
|
no default |
Optional key prefix. A trailing |
|
|
Extension appended to each emitted key. |
|
|
If |
|
|
Maximum HTTP connections in the S3 client pool. |
|
no default |
Custom S3 endpoint, for S3-compatible services. |
|
|
Force path-style URLs. |
S3 Iterator (s3-pipes-iterator)
Lists objects under a bucket/prefix and emits one FetchEmitTuple per object found.
{
"pipes-iterator": {
"s3-pipes-iterator": {
"bucket": "my-tika-input",
"region": "us-east-1",
"prefix": "incoming/",
"credentialsProvider": "profile",
"profile": "default",
"fetcherId": "s3f",
"emitterId": "s3e"
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
S3 bucket to enumerate. |
|
required |
AWS region. |
|
|
Key prefix to scope the listing. |
|
optional |
One of |
|
conditional |
Auth fields, mirroring the fetcher and emitter. |
|
no default |
Optional regex; only keys whose name matches are emitted. |
|
|
Maximum HTTP connections in the S3 client pool. |
|
|
Force path-style URLs. |
|
required |
IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract. |
Complete Pipeline Example
The example below wires the S3 fetcher, emitter, and iterator into a complete pipeline that lists s3://my-tika-input/incoming/ and writes results to s3://my-tika-output/results/.
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"s3f": {
"s3-fetcher": {
"bucket": "my-tika-input",
"region": "us-east-1",
"prefix": "incoming/",
"credentialsProvider": "profile",
"profile": "default",
"extractUserMetadata": true
}
}
},
"emitters": {
"s3e": {
"s3-emitter": {
"bucket": "my-tika-output",
"region": "us-east-1",
"prefix": "results/",
"fileExtension": "json",
"credentialsProvider": "profile",
"profile": "default"
}
}
},
"pipes-iterator": {
"s3-pipes-iterator": {
"bucket": "my-tika-input",
"region": "us-east-1",
"prefix": "incoming/",
"credentialsProvider": "profile",
"profile": "default",
"fetcherId": "s3f",
"emitterId": "s3e"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4
}
}
Notes
-
The fetcher, emitter, and iterator each maintain their own S3 client. Auth and endpoint settings need to be configured per component, not globally.
-
The S3 SDK enforces TLS 1.2+ by default; in-flight encryption is on. For at-rest encryption, configure bucket-level SSE on the AWS side.
-
When using
endpointConfigurationServiceagainst MinIO or LocalStack, you almost always needpathStyleAccessEnabled: true.