Google Cloud Storage Plugin

Table of Contents

Credentials
GCS Fetcher (gcs-fetcher)
- Configuration
GCS Emitter (gcs-emitter)
- Configuration
GCS Iterator (gcs-pipes-iterator)
- Configuration
Complete Pipeline Example
Notes

The Google Cloud Storage plugin (tika-pipes-gcs) provides fetcher, emitter, and iterator interfaces for objects in GCS buckets.

Interface Component name Class

Interface	Component name	Class
Fetcher	`gcs-fetcher`	`GCSFetcher`
Emitter	`gcs-emitter`	`GCSEmitter`
Iterator	`gcs-pipes-iterator`	`GCSPipesIterator`

Fetcher

gcs-fetcher

GCSFetcher

Emitter

gcs-emitter

GCSEmitter

Iterator

gcs-pipes-iterator

GCSPipesIterator

Credentials

The GCS plugin relies on Google’s Application Default Credentials chain — there are no credential fields in the JSON config itself. Set credentials by:

Running on a GCP service (GCE/GKE/Cloud Run) — uses the attached service account automatically.
Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of a service-account JSON key.
Running gcloud auth application-default login for local development.

The projectId field in each component selects which GCP project to bill the API calls against; the service account or user must have storage access to the named bucket.

GCS Fetcher (`gcs-fetcher`)

Reads objects from a GCS bucket. The fetch key is the object name.

{
  "fetchers": {
    "gcsf": {
      "gcs-fetcher": {
        "projectId": "my-gcp-project",
        "bucket": "my-tika-input",
        "extractUserMetadata": true,
        "spoolToTemp": true
      }
    }
  }
}

Configuration

Field Default Description

Field	Default	Description
`projectId`	required	GCP project ID for billing/authentication.
`bucket`	required	GCS bucket name.
`spoolToTemp`	`true`	If `true`, the fetched object is spooled to a temp file before parsing.
`extractUserMetadata`	`true`	If `true`, GCS custom metadata is copied into the parsed `Metadata`.

projectId

required

GCP project ID for billing/authentication.

bucket

required

GCS bucket name.

spoolToTemp

true

If true, the fetched object is spooled to a temp file before parsing.

extractUserMetadata

true

If true, GCS custom metadata is copied into the parsed Metadata.

GCS Emitter (`gcs-emitter`)

Writes parsed results to a GCS bucket. The emit key (relative to prefix) is derived from the FetchEmitTuple.

{
  "emitters": {
    "gcse": {
      "gcs-emitter": {
        "projectId": "my-gcp-project",
        "bucket": "my-tika-output",
        "prefix": "results/",
        "fileExtension": "json"
      }
    }
  }
}

Configuration

Field Default Description

Field	Default	Description
`projectId`	required	GCP project ID (validated non-blank).
`bucket`	required	Destination GCS bucket (validated non-blank).
`prefix`	no default	Optional object-name prefix. A trailing `/` is stripped automatically.
`fileExtension`	`json`	Extension appended to each emitted object name.

projectId

required

GCP project ID (validated non-blank).

bucket

required

Destination GCS bucket (validated non-blank).

prefix

no default

Optional object-name prefix. A trailing / is stripped automatically.

fileExtension

json

Extension appended to each emitted object name.

GCS Iterator (`gcs-pipes-iterator`)

Lists objects under a bucket/prefix and emits one FetchEmitTuple per object.

{
  "pipes-iterator": {
    "gcs-pipes-iterator": {
      "projectId": "my-gcp-project",
      "bucket": "my-tika-input",
      "prefix": "incoming/",
      "fetcherId": "gcsf",
      "emitterId": "gcse"
    }
  }
}

Configuration

Field Default Description

Field	Default	Description
`bucket`	required	GCS bucket to enumerate.
`projectId`	`""`	GCP project ID for the listing API call.
`prefix`	`""`	Object-name prefix to scope the listing.
`fetcherId` / `emitterId`	required	IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

bucket

required

GCS bucket to enumerate.

projectId

""

GCP project ID for the listing API call.

prefix

""

Object-name prefix to scope the listing.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

Complete Pipeline Example

The example below wires the GCS fetcher, emitter, and iterator together for a bucket-to-bucket pipeline.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "gcsf": {
      "gcs-fetcher": {
        "projectId": "my-gcp-project",
        "bucket": "my-tika-input",
        "extractUserMetadata": true
      }
    }
  },
  "emitters": {
    "gcse": {
      "gcs-emitter": {
        "projectId": "my-gcp-project",
        "bucket": "my-tika-output",
        "prefix": "results/",
        "fileExtension": "json"
      }
    }
  },
  "pipes-iterator": {
    "gcs-pipes-iterator": {
      "projectId": "my-gcp-project",
      "bucket": "my-tika-input",
      "prefix": "incoming/",
      "fetcherId": "gcsf",
      "emitterId": "gcse"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

The GCS plugin uses the official google-cloud-storage SDK. Set GOOGLE_APPLICATION_CREDENTIALS (or rely on workload identity / metadata server) to authenticate.
Each component creates its own Storage client. Heavy throughput should be balanced against your project’s per-second request quota.
Unlike S3, there is no path-style toggle — GCS uses a single global endpoint.

Google Cloud Storage Plugin

Credentials

GCS Fetcher (gcs-fetcher)

Configuration

GCS Emitter (gcs-emitter)

Configuration

GCS Iterator (gcs-pipes-iterator)

Configuration

Complete Pipeline Example

Notes

GCS Fetcher (`gcs-fetcher`)

GCS Emitter (`gcs-emitter`)

GCS Iterator (`gcs-pipes-iterator`)