Tika Server

This section covers running Apache Tika as a REST server via tika-server.

Overview

Tika Server provides a RESTful HTTP interface for parsing documents and extracting content. It can be deployed as a standalone service or in a containerized environment.

In Tika 4.x, the main content-extraction endpoints — /tika, /rmeta, and /unpack — parse in forked child processes via the Tika Pipes infrastructure. This provides process isolation (a parser crash or OOM in a child cannot take down the request-handling process) at the cost of requiring a Pipes configuration. A few endpoints (notably /meta) still parse in-process in the request-handling JVM; treat those as best-effort under load. See Migrating Tika Server to 4.x for the full breaking-change list when upgrading from 3.x.

Basic Usage

java -jar tika-server-standard-X.Y.Z.jar

The server starts on localhost:9998 by default.

Command Line Options

Option Description

-h <host> or --host <host>

Hostname to bind to. Default localhost. Use * to bind to all interfaces.

-p <port> or --port <port>

Listen port. Default 9998.

-c <file> or --config <file>

Path to tika-config.json. See Configuration below.

-a <file> or --pluginsConfig <file>

Path to the Tika Pipes plugins configuration file.

-i <id> or --id <id>

Server ID, surfaced in the /status endpoint and in logs.

-? or --help

Print the usage message.

Other behavior — enableUnsecureFeatures, CORS, TLS, timeouts — is configured in the JSON config file (see Configuration), not via CLI flags.

Endpoints

For the canonical endpoint inventory, including the PUT vs POST split and the multipart-config pattern introduced in 4.x, see the New /tika Endpoint Structure section of the migration guide. The most-used endpoints are summarized below.

Content Extraction (/tika)

Simple PUT — the entire request body is the document, no metadata:

# Default: raw XHTML
curl -T document.pdf http://localhost:9998/tika

# Explicit handler
curl -T document.pdf http://localhost:9998/tika/text
curl -T document.docx http://localhost:9998/tika/html
curl -T document.docx http://localhost:9998/tika/md
curl -T document.pdf http://localhost:9998/tika/json

POST with multipart for custom per-request configuration:

curl -X POST http://localhost:9998/tika/json \
  -F "file=@document.pdf" \
  -F "config={\"pdf-parser\":{\"ocrStrategy\":\"no_ocr\"}};type=application/json"

Valid handler paths under /tika/: text, html, xml, md, json. For the JSON variant, you can also nest a handler — /tika/json/text, /tika/json/html, etc. — to choose the content-field format inside the JSON envelope; that nested handler accepts the full set (text, html, xml, md, markdown, body, ignore).

X-Tika-Handler header

For the root /tika PUT endpoint you can also pick the handler with a header:

curl -T document.pdf -H "X-Tika-Handler: markdown" http://localhost:9998/tika

Accepted values: text, html, xml, markdown (or md), body, ignore.

Recursive Metadata (/rmeta)

Returns metadata for the container document and all embedded documents as a JSON array of metadata objects. The handler controls the content field of each entry:

curl -T document.pdf http://localhost:9998/rmeta            # default: text
curl -T document.pdf http://localhost:9998/rmeta/text
curl -T document.pdf http://localhost:9998/rmeta/html
curl -T document.pdf http://localhost:9998/rmeta/xml
curl -T document.docx http://localhost:9998/rmeta/markdown  # or /md
curl -T document.pdf http://localhost:9998/rmeta/ignore     # metadata only

Metadata only (/meta)

Returns container-document metadata only (no recursive embedded list, no content):

curl -T document.pdf http://localhost:9998/meta
curl -T document.pdf http://localhost:9998/meta/Content-Type   # single field

Other endpoints

  • /version — server version

  • /status — health/status (includes server ID)

  • /parsers and /parsers/details — registered parsers

  • /detectors — registered detectors

  • /mime-types — known MIME types

  • /detect/stream — type detection only (no parsing)

  • /language/stream, /language/string — language detection

  • /translate/all/{translator}/{src}/{dest} — translation

  • /pipes, /async — Pipes-based bulk processing

Configuration

Server behavior beyond host/port is controlled by a JSON config file passed via -c/--config. The server section in that file maps to fields on TikaServerConfig; commonly-set fields include:

Field Default Description

enableUnsecureFeatures

false

Enable the /config family of endpoints (see Security Configuration).

cors

"" (off)

* to allow any origin, or an explicit origin string. Empty disables CORS.

returnStackTrace

false

Include parser stack traces in error responses. Useful in dev, dangerous in production (leaks internals).

digest

"" (off)

Compute a digest of the parsed bytes. Comma-separated algorithm names: md5, sha1, sha256, sha384, sha512.

digestMarkLimit

20971520 (20 MiB)

Max bytes buffered for digest computation.

logLevel

inherited

debug or info to override the runtime log level.

idBase

random UUID

Override the auto-generated server ID (the -i CLI flag is the same setting).

For the full Pipes-related sections (pipes, fetchers, emitters, parse-context) that tika-server 4.x requires, see Configuration Changes.

Topics

Security Configuration

Config Endpoint Protection

By default, the /config family of endpoints that expose server configuration are disabled. These endpoints can reveal sensitive information about your server, including parser settings and system properties (see CVE-2015-3271).

Protected endpoints include:

  • /tika/config and /tika/config/{text,html,xml,md,json} — POST with multipart config

  • /rmeta/config — POST with multipart config

  • /meta/config — POST with multipart config

Enabling Config Endpoints

The setting is JSON-only — there is no CLI flag. Set enableUnsecureFeatures in your config file’s server section:

{
  "server": {
    "enableUnsecureFeatures": true
  }
}
Only enable enableUnsecureFeatures if you have secured access to Tika Server through network controls (firewalls, private subnets), a reverse proxy (nginx, Apache httpd), or 2-way TLS authentication. Exposing config endpoints to untrusted networks can help attackers identify vulnerabilities and craft targeted attacks.

Security Best Practices

  1. Keep config endpoints disabled in production (default behavior).

  2. Use network controls to restrict access (firewall rules, private subnets).

  3. Consider TLS for encrypted communication — see TLS Configuration.

  4. Run with minimal privileges — don’t run Tika Server as root.

  5. Monitor logs for unusual access patterns.