Tika Server

This section covers running Apache Tika as a REST server via tika-server.

Overview

Tika Server provides a RESTful HTTP interface for parsing documents and extracting content. It can be deployed as a standalone service or in a containerized environment.

Basic Usage

java -jar tika-server-standard.jar

The server starts on port 9998 by default.

Endpoints

Content Extraction (/tika)

The /tika endpoint extracts content from a document as plain text.

curl -T document.pdf http://localhost:9998/tika

Markdown Output (/tika/md)

The /tika/md endpoint extracts content as Markdown, preserving structural semantics like headings, lists, tables, and emphasis:

curl -T document.docx http://localhost:9998/tika/md

Custom Handler Type

Use the X-Tika-Handler header to control the output format. Valid values: text (default), html, xml, markdown, ignore.

curl -T document.pdf -H "X-Tika-Handler: markdown" http://localhost:9998/tika

Recursive Metadata (/rmeta)

The /rmeta endpoint returns metadata for the container document and all embedded documents as a JSON array of metadata objects.

curl -T document.pdf http://localhost:9998/rmeta

Content handler can be specified in the URL path:

  • /rmeta/text - plain text content (default)

  • /rmeta/html - HTML content

  • /rmeta/xml - XHTML content

  • /rmeta/markdown - Markdown content

  • /rmeta/ignore - metadata only, no content

curl -T document.docx http://localhost:9998/rmeta/markdown

Topics

Security Configuration

Config Endpoint Protection

By default, the /config endpoints that expose server configuration are disabled for security reasons. These endpoints can reveal sensitive information about your server configuration, including parser settings and system properties (see CVE-2015-3271).

The protected endpoints include:

  • /config - Returns the server’s full configuration

  • /config/parsers - Returns configured parsers

  • /config/detectors - Returns configured detectors

  • /config/mimeTypes - Returns MIME type mappings

Enabling Config Endpoints

To enable these endpoints:

{
  "server": {
    "enableUnsecureFeatures": true
  }
}
Only enable enableUnsecureFeatures if you have secured access to Tika Server through network controls (firewalls, private subnets), a reverse proxy (nginx, Apache httpd), or 2-way TLS authentication. Exposing config endpoints to untrusted networks can help attackers identify vulnerabilities and craft targeted attacks.

Command Line Usage

You can also enable unsecure features via command line:

java -jar tika-server-standard.jar --enableUnsecureFeatures

Security Best Practices

  1. Keep config endpoints disabled in production (default behavior)

  2. Use network controls to restrict access to the Tika Server (firewall rules, private subnets)

  3. Consider TLS for encrypted communication - see TLS Configuration

  4. Run with minimal privileges - don’t run Tika Server as root

  5. Monitor logs for unusual access patterns