Google Cloud Storage Plugin

The Google Cloud Storage plugin (tika-pipes-gcs) provides fetcher, emitter, and iterator interfaces for objects in GCS buckets.

Interface Component name Class

Fetcher

gcs-fetcher

GCSFetcher

Emitter

gcs-emitter

GCSEmitter

Iterator

gcs-pipes-iterator

GCSPipesIterator

Credentials

The GCS plugin relies on Google’s Application Default Credentials chain — there are no credential fields in the JSON config itself. Set credentials by:

  • Running on a GCP service (GCE/GKE/Cloud Run) — uses the attached service account automatically.

  • Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of a service-account JSON key.

  • Running gcloud auth application-default login for local development.

The projectId field in each component selects which GCP project to bill the API calls against; the service account or user must have storage access to the named bucket.

GCS Fetcher (gcs-fetcher)

Reads objects from a GCS bucket. The fetch key is the object name.

{
  "fetchers": {
    "gcsf": {
      "gcs-fetcher": {
        "projectId": "my-gcp-project",
        "bucket": "my-tika-input",
        "extractUserMetadata": true,
        "spoolToTemp": true
      }
    }
  }
}

Configuration

Field Default Description

projectId

required

GCP project ID for billing/authentication.

bucket

required

GCS bucket name.

spoolToTemp

true

If true, the fetched object is spooled to a temp file before parsing.

extractUserMetadata

true

If true, GCS custom metadata is copied into the parsed Metadata.

GCS Emitter (gcs-emitter)

Writes parsed results to a GCS bucket. The emit key (relative to prefix) is derived from the FetchEmitTuple.

{
  "emitters": {
    "gcse": {
      "gcs-emitter": {
        "projectId": "my-gcp-project",
        "bucket": "my-tika-output",
        "prefix": "results/",
        "fileExtension": "json"
      }
    }
  }
}

Configuration

Field Default Description

projectId

required

GCP project ID (validated non-blank).

bucket

required

Destination GCS bucket (validated non-blank).

prefix

no default

Optional object-name prefix. A trailing / is stripped automatically.

fileExtension

json

Extension appended to each emitted object name.

GCS Iterator (gcs-pipes-iterator)

Lists objects under a bucket/prefix and emits one FetchEmitTuple per object.

{
  "pipes-iterator": {
    "gcs-pipes-iterator": {
      "projectId": "my-gcp-project",
      "bucket": "my-tika-input",
      "prefix": "incoming/",
      "fetcherId": "gcsf",
      "emitterId": "gcse"
    }
  }
}

Configuration

Field Default Description

bucket

required

GCS bucket to enumerate.

projectId

""

GCP project ID for the listing API call.

prefix

""

Object-name prefix to scope the listing.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

Complete Pipeline Example

The example below wires the GCS fetcher, emitter, and iterator together for a bucket-to-bucket pipeline.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "gcsf": {
      "gcs-fetcher": {
        "projectId": "my-gcp-project",
        "bucket": "my-tika-input",
        "extractUserMetadata": true
      }
    }
  },
  "emitters": {
    "gcse": {
      "gcs-emitter": {
        "projectId": "my-gcp-project",
        "bucket": "my-tika-output",
        "prefix": "results/",
        "fileExtension": "json"
      }
    }
  },
  "pipes-iterator": {
    "gcs-pipes-iterator": {
      "projectId": "my-gcp-project",
      "bucket": "my-tika-input",
      "prefix": "incoming/",
      "fetcherId": "gcsf",
      "emitterId": "gcse"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

  • The GCS plugin uses the official google-cloud-storage SDK. Set GOOGLE_APPLICATION_CREDENTIALS (or rely on workload identity / metadata server) to authenticate.

  • Each component creates its own Storage client. Heavy throughput should be balanced against your project’s per-second request quota.

  • Unlike S3, there is no path-style toggle — GCS uses a single global endpoint.