Releasing Tika Docker Images

This guide covers releasing the official Apache Tika Docker images (apache/tika and apache/tika-grpc on Docker Hub).

Where the Dockerfiles live

Starting with 4.0.0-alpha-1, the Dockerfiles and the GitHub Actions workflow that publishes them live in this repository:

  • tika-server/docker-build/{minimal,full}/Dockerfileapache/tika (server) release builds

  • tika-server/docker-build/{minimal,full}/Dockerfile.snapshot — nightly snapshot builds

  • tika-grpc/docker-build/Dockerfileapache/tika-grpc release builds

  • .github/workflows/docker-release.yml — the release publishing workflow

  • .github/workflows/docker-snapshot.yml — the snapshot publishing workflow (auto on push to main)

The legacy apache/tika-docker repository is still used for 3.x patch releases — see [3x-patches] below. New 4.x work happens here.

Image types

minimal

Apache Tika server with base dependencies (Java + the unpacked tika-server-standard-<v>.zip).

full

Adds Tesseract OCR, GDAL, ImageMagick, and Microsoft fonts.

apache/tika-grpc

The gRPC server packaged with parser-package jars and pipes plugin zips.

Prerequisites

  • You have committer permission on apache/tika (the GitHub repo). The Docker release workflow is gated to maintainers via the standard repo permission model — no separate Docker Hub credential is needed at trigger time; Docker Hub auth is held by the workflow as a secret.

  • The Tika release vote has passed and the artifacts have been moved from dist/dev to dist/release (i.e., the bin.zip and parser-package jars are already on dlcdn.apache.org/downloads.apache.org). The workflow downloads those artifacts during the build, so they must be live first.

  • The release tag (e.g. 4.0.0-alpha-1) exists in the repo. release:perform creates it during the upstream release.

Release process

Step 1: Verify the upstream artifacts are live

curl -sLI https://downloads.apache.org/tika/<TAG>/tika-server-standard-<TAG>.zip \
  | head -1

If you get a 200, you’re ready. If 404, the SVN move from dist/dev to dist/release hasn’t propagated yet — wait a few minutes.

Step 2: Trigger the Docker release workflow

The workflow has two trigger sources:

Auto-trigger on GA tag push. When the release manager pushes a digit-only-with-dots tag (e.g. 4.0.0, 10.20.30), the workflow fires automatically. Prerelease tags (4.0.0-rc1, 4.0.0-alpha-1, anything with a hyphen) and branch-style tags (branch_4x, anything with an underscore) are filtered out by tags-ignore: ['-', '_'] and stay silent.

A second validate-tag gating job enforces strict X.Y.Z shape on push triggers (defense-in-depth against odd tag names like wip that bypass the tags-ignore filter). It fails fast with a clear error before any build starts. It’s skipped for workflow_dispatch triggers, which are intentionally permissive — that path is used for prerelease publishes where the tag name won’t be GA-shaped.

The standard ASF release flow looks like:

  1. release:prepare creates X.Y.Z-rcN for the vote → workflow does not fire (hyphenated tag).

  2. Vote passes.

  3. The release manager creates the GA tag, e.g. git tag X.Y.Z X.Y.Z-rcN && git push origin X.Y.Z.

  4. That push triggers the Docker workflow. build_number defaults to 1.

Manual trigger via workflow_dispatch. Use this for any preview release (the auto-trigger ignores prerelease tags), or for any Docker-only rebuild where you need to bump build_number.

The workflow takes two inputs in this mode:

tag

The Tika release tag, e.g. 4.0.0-alpha-1. Must already exist as a git tag.

build_number

The Docker build number for this Tika tag. Use 1 for the initial publish; increment when re-publishing the same Tika version with Docker-only changes (CVE fixes in the base image, refreshed apt packages, etc.). See Re-publishing an existing Tika version (Docker-only rebuild) below for the full rebuild flow.

source_ref

Optional. Git ref to build from. Defaults to the value of tag. Override only for Docker-only rebuilds where the Dockerfile or other build inputs have changed since the original tag was cut — for example, when you’ve made Dockerfile updates on main after the GA release and want build 2 to pick them up.

Via the GitHub UI:

  1. Open https://github.com/apache/tika/actions

  2. Select Docker release - tika-server and tika-grpc in the left sidebar

  3. Click Run workflow (top-right)

  4. Fill in tag (e.g. 4.0.0-alpha-1) and build_number (e.g. 1)

  5. Click Run workflow

Via the gh CLI:

gh workflow run docker-release.yml \
  -f tag=4.0.0-alpha-1 \
  -f build_number=1

Tag scheme

Each workflow run publishes three tags per image, all pointing at the same manifest digest:

Tag Meaning Moves on rebuild?

apache/tika:<tag>

Mutable rolling tag for this Tika version (e.g. apache/tika:4.0.0-alpha-1).

Yes — retagged to the new digest

apache/tika:<tag>-<N>

Immutable build pin (e.g. apache/tika:4.0.0-alpha-1-1 for the first build). Pin by this if you need stability across rebuilds.

No — never reassigned

apache/tika:latest

Mutable rolling tag for the newest stable Tika release. Pushed only for non-prerelease tags (i.e., no -alpha, -BETA, -RC). Stays on 3.x until 4.0.0 GA.

Yes — for stable releases only

The -full variants (<tag>-full, <tag>-<N>-full, latest-full) follow the same scheme. apache/tika-grpc also publishes the three-tag pattern, but its :latest is pushed unconditionally (no 3.x incumbent to protect).

Re-publishing an existing Tika version (Docker-only rebuild)

When the Tika source hasn’t changed but you need a new Docker image — base image CVE, refreshed apt packages, Dockerfile fix — bump build_number instead of cutting a new Tika version.

The Tika git tag (e.g. 4.0.0) stays put. The -<N> suffix in apache/tika:4.0.0-2 is a Docker Hub tag only, never a git tag pushed by hand. The workflow auto-creates a 4.0.0-2 git tag at the same SHA it built from for provenance.

Case 1: pure base-image refresh (no Dockerfile changes — FROM ubuntu:resolute just picks up newer upstream layers).

gh workflow run docker-release.yml \
  -f tag=4.0.0 \
  -f build_number=2

source_ref defaults to the tag, so the workflow checks out at the original 4.0.0 source state.

Case 2: Dockerfile changes since the original release. Land the Dockerfile changes on main first (or on a branch). Then point the workflow at that ref:

gh workflow run docker-release.yml \
  -f tag=4.0.0 \
  -f build_number=2 \
  -f source_ref=main

In either case, the workflow:

  1. Builds from inputs.source_ref (or the original tag if unset).

  2. Publishes apache/tika:4.0.0-2 (immutable), retags apache/tika:4.0.0 and apache/tika:latest to the new digest, plus the matching -full and tika-grpc tags.

  3. Pushes a git tag 4.0.0-2 pointing at the source SHA used. The tags-ignore: ['-'] rule means this provenance tag does not re-trigger the workflow.

Six months later, git show 4.0.0-2 shows the exact source state for that build and docker pull apache/tika:4.0.0-2 returns the image built from it.

The provenance-tag step runs only when build_number != 1. The initial build’s source state is already marked by the original Tika git tag (e.g. 4.0.0); no need to duplicate it as 4.0.0-1.

Step 3: Watch the run

A successful run takes ~30–45 minutes (multi-arch builds across linux/amd64, linux/arm64, linux/s390x are slow under qemu emulation, especially the full image).

  • GitHub UI: the Actions run page streams logs.

  • CLI: gh run watch will tail the latest run.

The workflow does three things:

  1. Builds and pushes apache/tika:<TAG> (minimal, multi-arch).

  2. Builds and pushes apache/tika:<TAG>-full (full, multi-arch).

  3. Builds and pushes apache/tika-grpc:<TAG> (multi-arch).

Step 4: Verify the published images

# Confirm the manifest landed:
curl -sL "https://hub.docker.com/v2/repositories/apache/tika/tags/<TAG>/" \
  | python3 -c "import sys,json;d=json.load(sys.stdin);print(d.get('tag_last_pushed'), d.get('digest'))"

# Smoke-test the image locally:
docker pull apache/tika:<TAG>
docker run --rm -d --name tika-uat -p 127.0.0.1:9998:9998 apache/tika:<TAG>
sleep 12
curl -s http://localhost:9998/version
docker rm -f tika-uat

For a deeper smoke test that exercises the full REST surface, run the REST UAT script (the same one tied into the e2e tests):

release-tools/uat/run-uat.sh http://localhost:9998

Both apache/tika:<TAG> and apache/tika:<TAG>-full should pass.

:latest tag policy

The apache/tika:latest and apache/tika:latest-full tags currently still point at the 3.x stable image (the latest-tagged 3.3.0 image published from the external apache/tika-docker repo).

The release workflow deliberately does not push :latest for 4.x alpha/beta/RC builds — those tags stay on 3.x until 4.0.0 GA. When 4.0.0 GA ships, edit docker-release.yml to re-add apache/tika:latest and apache/tika:latest-full to the tag lists.

apache/tika-grpc:latest is pushed on every 4.x release — the grpc image is new in 4.x and has no 3.x incumbent to protect.

[[3x-patches]] == 3.x patch releases (legacy path)

Until 4.0.0 GA, any 3.x patch release (e.g. a 3.3.0.1 with a CVE fix) is still published from the legacy apache/tika-docker repository using its docker-tool.sh:

git clone https://github.com/apache/tika-docker
cd tika-docker

# Edit README.md (Available Tags), CHANGES.md, .env (TAG=...)
# Then commit + push

./docker-tool.sh build <DOCKER_VERSION> <TIKA_VERSION>
./docker-tool.sh test <DOCKER_VERSION>
./docker-tool.sh publish <DOCKER_VERSION> <TIKA_VERSION>

git tag -a <DOCKER_VERSION> -m "New release for <DOCKER_VERSION>"
git push --tags

Use the 3.x convention <TIKA_VERSION>.<DOCKER_BUILD_NUMBER> (e.g. 3.3.0.1 for the first Docker rebuild on top of Tika 3.3.0). 4.x releases drop that scheme and publish bare <TIKA_VERSION> only.

Post-release

After the workflow completes: