Tika Server
This section covers running Apache Tika as a REST server via tika-server.
Overview
Tika Server provides a RESTful HTTP interface for parsing documents and extracting content. It can be deployed as a standalone service or in a containerized environment.
In Tika 4.x, the main content-extraction endpoints — /tika, /rmeta, and
/unpack — parse in forked child processes via the Tika Pipes infrastructure.
This provides process isolation (a parser crash or OOM in a child cannot take
down the request-handling process) at the cost of requiring a Pipes
configuration. A few endpoints (notably /meta) still parse in-process in the
request-handling JVM; treat those as best-effort under load. See
Migrating Tika Server to 4.x
for the full breaking-change list when upgrading from 3.x.
Basic Usage
java -jar tika-server-standard-X.Y.Z.jar
The server starts on localhost:9998 by default.
Command Line Options
| Option | Description |
|---|---|
|
Hostname to bind to. Default |
|
Listen port. Default |
|
Path to |
|
Path to the Tika Pipes plugins configuration file. |
|
Server ID, surfaced in the |
|
Print the usage message. |
Other behavior — enableUnsecureFeatures, CORS, TLS, timeouts — is configured
in the JSON config file (see Configuration), not via CLI flags.
|
Endpoints
For the canonical endpoint inventory, including the PUT vs POST split and the
multipart-config pattern introduced in 4.x, see the
New /tika Endpoint Structure
section of the migration guide. The most-used endpoints are summarized below.
Content Extraction (/tika)
Simple PUT — the entire request body is the document, no metadata:
# Default: raw XHTML
curl -T document.pdf http://localhost:9998/tika
# Explicit handler
curl -T document.pdf http://localhost:9998/tika/text
curl -T document.docx http://localhost:9998/tika/html
curl -T document.docx http://localhost:9998/tika/md
curl -T document.pdf http://localhost:9998/tika/json
POST with multipart for custom per-request configuration:
curl -X POST http://localhost:9998/tika/json \
-F "file=@document.pdf" \
-F "config={\"pdf-parser\":{\"ocrStrategy\":\"no_ocr\"}};type=application/json"
Valid handler paths under /tika/: text, html, xml, md, json. For
the JSON variant, you can also nest a handler — /tika/json/text,
/tika/json/html, etc. — to choose the content-field format inside the JSON
envelope; that nested handler accepts the full set (text, html, xml,
md, markdown, body, ignore).
Recursive Metadata (/rmeta)
Returns metadata for the container document and all embedded documents as a JSON array of metadata objects. The handler controls the content field of each entry:
curl -T document.pdf http://localhost:9998/rmeta # default: text
curl -T document.pdf http://localhost:9998/rmeta/text
curl -T document.pdf http://localhost:9998/rmeta/html
curl -T document.pdf http://localhost:9998/rmeta/xml
curl -T document.docx http://localhost:9998/rmeta/markdown # or /md
curl -T document.pdf http://localhost:9998/rmeta/ignore # metadata only
Metadata only (/meta)
Returns container-document metadata only (no recursive embedded list, no content):
curl -T document.pdf http://localhost:9998/meta
curl -T document.pdf http://localhost:9998/meta/Content-Type # single field
Other endpoints
-
/version— server version -
/status— health/status (includes server ID) -
/parsersand/parsers/details— registered parsers -
/detectors— registered detectors -
/mime-types— known MIME types -
/detect/stream— type detection only (no parsing) -
/language/stream,/language/string— language detection -
/translate/all/{translator}/{src}/{dest}— translation -
/pipes,/async— Pipes-based bulk processing
Configuration
Server behavior beyond host/port is controlled by a JSON config file passed via
-c/--config. The server section in that file maps to fields on
TikaServerConfig; commonly-set fields include:
| Field | Default | Description |
|---|---|---|
|
|
Enable the |
|
|
|
|
|
Include parser stack traces in error responses. Useful in dev, dangerous in production (leaks internals). |
|
|
Compute a digest of the parsed bytes. Comma-separated algorithm names: |
|
|
Max bytes buffered for digest computation. |
|
inherited |
|
|
random UUID |
Override the auto-generated server ID (the |
For the full Pipes-related sections (pipes, fetchers, emitters, parse-context)
that tika-server 4.x requires, see
Configuration Changes.
Topics
-
TLS/SSL Configuration — Secure your server with TLS and mutual authentication
-
Migrating Tika Server to 4.x — Breaking changes from 3.x
Security Configuration
Config Endpoint Protection
By default, the /config family of endpoints that expose server configuration are
disabled. These endpoints can reveal sensitive information about your server,
including parser settings and system properties (see
CVE-2015-3271).
Protected endpoints include:
-
/tika/configand/tika/config/{text,html,xml,md,json}— POST with multipart config -
/rmeta/config— POST with multipart config -
/meta/config— POST with multipart config
Enabling Config Endpoints
The setting is JSON-only — there is no CLI flag. Set enableUnsecureFeatures in
your config file’s server section:
{
"server": {
"enableUnsecureFeatures": true
}
}
Only enable enableUnsecureFeatures if you have secured access to Tika
Server through network controls (firewalls, private subnets), a reverse proxy
(nginx, Apache httpd), or
2-way TLS authentication. Exposing config endpoints
to untrusted networks can help attackers identify vulnerabilities and craft
targeted attacks.
|
Security Best Practices
-
Keep config endpoints disabled in production (default behavior).
-
Use network controls to restrict access (firewall rules, private subnets).
-
Consider TLS for encrypted communication — see TLS Configuration.
-
Run with minimal privileges — don’t run Tika Server as root.
-
Monitor logs for unusual access patterns.