Amazon S3 Plugin

The Amazon S3 plugin (tika-pipes-s3) provides fetcher, emitter, and iterator interfaces for objects in S3 (or any S3-compatible service such as MinIO).

Interface Component name Class

Fetcher

s3-fetcher

S3Fetcher

Emitter

s3-emitter

S3Emitter

Iterator

s3-pipes-iterator

S3PipesIterator

Credentials

All three components share the same credentialsProvider selector:

  • profile — reads credentials from the local AWS profile named by profile (e.g., default).

  • instance — uses the instance/container role attached to the host (EC2 IAM role, ECS task role, etc.). No additional fields needed.

  • key_secret — reads accessKey and secretKey from the config. Avoid checking these into source control; prefer environment-variable substitution or one of the other providers.

The emitter’s validate() enforces these values, but the fetcher and iterator do not — they fail later when the AWS SDK tries to resolve credentials.

S3 Fetcher (s3-fetcher)

Reads objects from an S3 bucket. The fetch key is the S3 key under prefix (if set).

{
  "fetchers": {
    "s3f": {
      "s3-fetcher": {
        "bucket": "my-tika-input",
        "region": "us-east-1",
        "prefix": "incoming/",
        "credentialsProvider": "profile",
        "profile": "default",
        "extractUserMetadata": true,
        "spoolToTemp": true
      }
    }
  }
}

Configuration

Field Default Description

bucket

required

S3 bucket name.

region

required

AWS region (e.g., us-east-1).

prefix

no default

Optional key prefix. Fetch keys are resolved underneath this prefix.

credentialsProvider

required

One of profile, instance, key_secret. See Credentials.

profile / accessKey / secretKey

conditional

Required by the matching credentialsProvider.

spoolToTemp

true

If true, the fetched object is spooled to a temp file before being parsed.

extractUserMetadata

true

If true, S3 user-metadata is copied into the parsed Metadata.

maxConnections

0

Maximum HTTP connections in the S3 client pool. 0 lets the SDK pick a default.

maxLength

-1

Maximum object size, in bytes. -1 means no limit.

endpointConfigurationService

no default

Custom S3 endpoint, for S3-compatible services such as MinIO or LocalStack.

pathStyleAccessEnabled

false

Force path-style URLs (e.g., https://endpoint/bucket/key). Required by some S3-compatible services.

throttleSeconds

no default

Optional rate-limit array; consecutive failures sleep for the corresponding number of seconds.

S3 Emitter (s3-emitter)

Writes parsed results back to an S3 bucket. The emit key (relative to prefix) is derived from the FetchEmitTuple.

{
  "emitters": {
    "s3e": {
      "s3-emitter": {
        "bucket": "my-tika-output",
        "region": "us-east-1",
        "prefix": "results/",
        "fileExtension": "json",
        "credentialsProvider": "profile",
        "profile": "default"
      }
    }
  }
}

Configuration

Field Default Description

bucket

required

Destination S3 bucket name (validated non-blank).

region

required

AWS region (validated non-blank).

credentialsProvider

required

One of profile, instance, key_secret (validated). See Credentials.

profile / accessKey / secretKey

conditional

Required by the matching credentialsProvider (validated).

prefix

no default

Optional key prefix. A trailing / is stripped automatically.

fileExtension

json

Extension appended to each emitted key.

spoolToTemp

true

If true, output is spooled locally before being uploaded.

maxConnections

50

Maximum HTTP connections in the S3 client pool.

endpointConfigurationService

no default

Custom S3 endpoint, for S3-compatible services.

pathStyleAccessEnabled

false

Force path-style URLs.

S3 Iterator (s3-pipes-iterator)

Lists objects under a bucket/prefix and emits one FetchEmitTuple per object found.

{
  "pipes-iterator": {
    "s3-pipes-iterator": {
      "bucket": "my-tika-input",
      "region": "us-east-1",
      "prefix": "incoming/",
      "credentialsProvider": "profile",
      "profile": "default",
      "fetcherId": "s3f",
      "emitterId": "s3e"
    }
  }
}

Configuration

Field Default Description

bucket

required

S3 bucket to enumerate.

region

required

AWS region.

prefix

""

Key prefix to scope the listing.

credentialsProvider

optional

One of profile, instance, key_secret. See Credentials.

profile / accessKey / secretKey / endpointConfigurationService

conditional

Auth fields, mirroring the fetcher and emitter.

fileNamePattern

no default

Optional regex; only keys whose name matches are emitted.

maxConnections

50

Maximum HTTP connections in the S3 client pool.

pathStyleAccessEnabled

false

Force path-style URLs.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

Complete Pipeline Example

The example below wires the S3 fetcher, emitter, and iterator into a complete pipeline that lists s3://my-tika-input/incoming/ and writes results to s3://my-tika-output/results/.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "s3f": {
      "s3-fetcher": {
        "bucket": "my-tika-input",
        "region": "us-east-1",
        "prefix": "incoming/",
        "credentialsProvider": "profile",
        "profile": "default",
        "extractUserMetadata": true
      }
    }
  },
  "emitters": {
    "s3e": {
      "s3-emitter": {
        "bucket": "my-tika-output",
        "region": "us-east-1",
        "prefix": "results/",
        "fileExtension": "json",
        "credentialsProvider": "profile",
        "profile": "default"
      }
    }
  },
  "pipes-iterator": {
    "s3-pipes-iterator": {
      "bucket": "my-tika-input",
      "region": "us-east-1",
      "prefix": "incoming/",
      "credentialsProvider": "profile",
      "profile": "default",
      "fetcherId": "s3f",
      "emitterId": "s3e"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

  • The fetcher, emitter, and iterator each maintain their own S3 client. Auth and endpoint settings need to be configured per component, not globally.

  • The S3 SDK enforces TLS 1.2+ by default; in-flight encryption is on. For at-rest encryption, configure bucket-level SSE on the AWS side.

  • When using endpointConfigurationService against MinIO or LocalStack, you almost always need pathStyleAccessEnabled: true.