Azure Blob Storage Plugin

The Azure Blob Storage plugin (tika-pipes-az-blob) provides fetcher, emitter, and iterator interfaces for blobs in Azure Storage containers.

Interface Component name Class

Fetcher

az-blob-fetcher

AZBlobFetcher

Emitter

az-blob-emitter

AZBlobEmitter

Iterator

az-blob-pipes-iterator

AZBlobPipesIterator

Credentials

All three components authenticate with a SAS (shared-access-signature) token. There are no other auth modes — managed identity, account keys, and AD-based auth are not currently exposed.

  • endpoint — base URL of the storage account, e.g., https://myaccount.blob.core.windows.net.

  • sasToken — the URL query-string portion of a generated SAS, without a leading ?. Permissions in the token must match the operations the component will perform (read for fetchers/iterators, read+write for emitters).

The emitter’s validate() enforces that sasToken, endpoint, and container are all non-blank, but does not parse the SAS itself — invalid or expired tokens fail later when the Azure SDK makes a request.

Azure Blob Fetcher (az-blob-fetcher)

Reads blobs from an Azure Storage container. The fetch key is the blob name.

{
  "fetchers": {
    "azf": {
      "az-blob-fetcher": {
        "endpoint": "https://myaccount.blob.core.windows.net",
        "container": "tika-input",
        "sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
        "extractUserMetadata": true,
        "spoolToTemp": true
      }
    }
  }
}

Configuration

Field Default Description

endpoint

required

Storage account URL.

container

required

Container name.

sasToken

required

SAS token granting read access to the container.

spoolToTemp

true

If true, the fetched blob is spooled to a temp file before parsing.

extractUserMetadata

true

If true, blob user-metadata is copied into the parsed Metadata.

Azure Blob Emitter (az-blob-emitter)

Writes parsed results to an Azure Storage container. The emit key (relative to prefix) is derived from the FetchEmitTuple.

{
  "emitters": {
    "aze": {
      "az-blob-emitter": {
        "endpoint": "https://myaccount.blob.core.windows.net",
        "container": "tika-output",
        "sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rwl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
        "prefix": "results/",
        "fileExtension": "json",
        "overwriteExisting": false
      }
    }
  }
}

Configuration

Field Default Description

endpoint

required

Storage account URL (validated non-blank).

container

required

Destination container name (validated non-blank).

sasToken

required

SAS token granting read+write access (validated non-blank).

prefix

no default

Optional blob-name prefix. A trailing / is stripped automatically.

fileExtension

json

Extension appended to each emitted blob name.

overwriteExisting

false

If true, an existing blob with the same name is overwritten; otherwise the emit fails.

Azure Blob Iterator (az-blob-pipes-iterator)

Lists blobs under a container/prefix and emits one FetchEmitTuple per blob.

{
  "pipes-iterator": {
    "az-blob-pipes-iterator": {
      "endpoint": "https://myaccount.blob.core.windows.net",
      "container": "tika-input",
      "sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
      "prefix": "incoming/",
      "timeoutMillis": 360000,
      "fetcherId": "azf",
      "emitterId": "aze"
    }
  }
}

Configuration

Field Default Description

endpoint

required

Storage account URL.

container

required

Container to enumerate.

sasToken

required

SAS token granting list+read access.

prefix

""

Blob-name prefix to scope the listing.

timeoutMillis

360000

Per-request timeout, in milliseconds (6 minutes by default).

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

Complete Pipeline Example

The example below wires the Azure Blob fetcher, emitter, and iterator together into a container-to-container pipeline.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "azf": {
      "az-blob-fetcher": {
        "endpoint": "https://myaccount.blob.core.windows.net",
        "container": "tika-input",
        "sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
        "extractUserMetadata": true
      }
    }
  },
  "emitters": {
    "aze": {
      "az-blob-emitter": {
        "endpoint": "https://myaccount.blob.core.windows.net",
        "container": "tika-output",
        "sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rwl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
        "prefix": "results/",
        "fileExtension": "json"
      }
    }
  },
  "pipes-iterator": {
    "az-blob-pipes-iterator": {
      "endpoint": "https://myaccount.blob.core.windows.net",
      "container": "tika-input",
      "sasToken": "sv=2024-11-04&ss=b&srt=sco&sp=rl&se=2030-01-01T00:00:00Z&st=2024-01-01T00:00:00Z&spr=https&sig=REDACTED",
      "prefix": "incoming/",
      "fetcherId": "azf",
      "emitterId": "aze"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

  • SAS tokens have an expiration baked in. For long-running pipelines, rotate the SAS or use a token that outlives the pipeline window.

  • Avoid checking real SAS tokens into source control — the strings in the examples above are placeholders.

  • Each component creates its own BlobServiceClient. The Azure SDK pools HTTP connections per client.