Azure Blob Storage Plugin
The Azure Blob Storage plugin (tika-pipes-az-blob) provides fetcher, emitter, and iterator interfaces for blobs in Azure Storage containers.
| Interface | Component name | Class |
|---|---|---|
Fetcher |
|
|
Emitter |
|
|
Iterator |
|
|
Credentials
All three components authenticate with a SAS (shared-access-signature) token. There are no other auth modes — managed identity, account keys, and AD-based auth are not currently exposed.
-
endpoint— base URL of the storage account, e.g.,https://myaccount.blob.core.windows.net. -
sasToken— the URL query-string portion of a generated SAS, without a leading?. Permissions in the token must match the operations the component will perform (read for fetchers/iterators, read+write for emitters).
The emitter’s validate() enforces that sasToken, endpoint, and container are all non-blank, but does not parse the SAS itself — invalid or expired tokens fail later when the Azure SDK makes a request.
Azure Blob Fetcher (az-blob-fetcher)
Reads blobs from an Azure Storage container. The fetch key is the blob name.
{
"fetchers": {
"azf": {
"az-blob-fetcher": {
"endpoint": "https://myaccount.blob.core.windows.net",
"container": "tika-input",
"sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
"extractUserMetadata": true,
"spoolToTemp": true
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Storage account URL. |
|
required |
Container name. |
|
required |
SAS token granting read access to the container. |
|
|
If |
|
|
If |
Azure Blob Emitter (az-blob-emitter)
Writes parsed results to an Azure Storage container. The emit key (relative to prefix) is derived from the FetchEmitTuple.
{
"emitters": {
"aze": {
"az-blob-emitter": {
"endpoint": "https://myaccount.blob.core.windows.net",
"container": "tika-output",
"sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rwl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
"prefix": "results/",
"fileExtension": "json",
"overwriteExisting": false
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Storage account URL (validated non-blank). |
|
required |
Destination container name (validated non-blank). |
|
required |
SAS token granting read+write access (validated non-blank). |
|
no default |
Optional blob-name prefix. A trailing |
|
|
Extension appended to each emitted blob name. |
|
|
If |
Azure Blob Iterator (az-blob-pipes-iterator)
Lists blobs under a container/prefix and emits one FetchEmitTuple per blob.
{
"pipes-iterator": {
"az-blob-pipes-iterator": {
"endpoint": "https://myaccount.blob.core.windows.net",
"container": "tika-input",
"sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
"prefix": "incoming/",
"timeoutMillis": 360000,
"fetcherId": "azf",
"emitterId": "aze"
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Storage account URL. |
|
required |
Container to enumerate. |
|
required |
SAS token granting list+read access. |
|
|
Blob-name prefix to scope the listing. |
|
|
Per-request timeout, in milliseconds (6 minutes by default). |
|
required |
IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract. |
Complete Pipeline Example
The example below wires the Azure Blob fetcher, emitter, and iterator together into a container-to-container pipeline.
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"azf": {
"az-blob-fetcher": {
"endpoint": "https://myaccount.blob.core.windows.net",
"container": "tika-input",
"sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
"extractUserMetadata": true
}
}
},
"emitters": {
"aze": {
"az-blob-emitter": {
"endpoint": "https://myaccount.blob.core.windows.net",
"container": "tika-output",
"sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rwl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
"prefix": "results/",
"fileExtension": "json"
}
}
},
"pipes-iterator": {
"az-blob-pipes-iterator": {
"endpoint": "https://myaccount.blob.core.windows.net",
"container": "tika-input",
"sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
"prefix": "incoming/",
"fetcherId": "azf",
"emitterId": "aze"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4
}
}
Notes
-
SAS tokens have an expiration baked in. For long-running pipelines, rotate the SAS or use a token that outlives the pipeline window.
-
Avoid checking real SAS tokens into source control — the strings in the examples above are placeholders.
-
Each component creates its own
BlobServiceClient. The Azure SDK pools HTTP connections per client.