Google Cloud Storage Plugin
The Google Cloud Storage plugin (tika-pipes-gcs) provides fetcher, emitter, and iterator interfaces for objects in GCS buckets.
| Interface | Component name | Class |
|---|---|---|
Fetcher |
|
|
Emitter |
|
|
Iterator |
|
|
Credentials
The GCS plugin relies on Google’s Application Default Credentials chain — there are no credential fields in the JSON config itself. Set credentials by:
-
Running on a GCP service (GCE/GKE/Cloud Run) — uses the attached service account automatically.
-
Setting the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable to the path of a service-account JSON key. -
Running
gcloud auth application-default loginfor local development.
The projectId field in each component selects which GCP project to bill the API calls against; the service account or user must have storage access to the named bucket.
GCS Fetcher (gcs-fetcher)
Reads objects from a GCS bucket. The fetch key is the object name.
{
"fetchers": {
"gcsf": {
"gcs-fetcher": {
"projectId": "my-gcp-project",
"bucket": "my-tika-input",
"extractUserMetadata": true,
"spoolToTemp": true
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
GCP project ID for billing/authentication. |
|
required |
GCS bucket name. |
|
|
If |
|
|
If |
GCS Emitter (gcs-emitter)
Writes parsed results to a GCS bucket. The emit key (relative to prefix) is derived from the FetchEmitTuple.
{
"emitters": {
"gcse": {
"gcs-emitter": {
"projectId": "my-gcp-project",
"bucket": "my-tika-output",
"prefix": "results/",
"fileExtension": "json"
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
GCP project ID (validated non-blank). |
|
required |
Destination GCS bucket (validated non-blank). |
|
no default |
Optional object-name prefix. A trailing |
|
|
Extension appended to each emitted object name. |
GCS Iterator (gcs-pipes-iterator)
Lists objects under a bucket/prefix and emits one FetchEmitTuple per object.
{
"pipes-iterator": {
"gcs-pipes-iterator": {
"projectId": "my-gcp-project",
"bucket": "my-tika-input",
"prefix": "incoming/",
"fetcherId": "gcsf",
"emitterId": "gcse"
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
GCS bucket to enumerate. |
|
|
GCP project ID for the listing API call. |
|
|
Object-name prefix to scope the listing. |
|
required |
IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract. |
Complete Pipeline Example
The example below wires the GCS fetcher, emitter, and iterator together for a bucket-to-bucket pipeline.
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"gcsf": {
"gcs-fetcher": {
"projectId": "my-gcp-project",
"bucket": "my-tika-input",
"extractUserMetadata": true
}
}
},
"emitters": {
"gcse": {
"gcs-emitter": {
"projectId": "my-gcp-project",
"bucket": "my-tika-output",
"prefix": "results/",
"fileExtension": "json"
}
}
},
"pipes-iterator": {
"gcs-pipes-iterator": {
"projectId": "my-gcp-project",
"bucket": "my-tika-input",
"prefix": "incoming/",
"fetcherId": "gcsf",
"emitterId": "gcse"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4
}
}
Notes
-
The GCS plugin uses the official
google-cloud-storageSDK. SetGOOGLE_APPLICATION_CREDENTIALS(or rely on workload identity / metadata server) to authenticate. -
Each component creates its own
Storageclient. Heavy throughput should be balanced against your project’s per-second request quota. -
Unlike S3, there is no
path-styletoggle — GCS uses a single global endpoint.