Configuration
This section covers configuring Apache Tika.
Overview
Tika 4.x uses JSON configuration files. Configuration controls parsers, detectors, content handlers, server behavior, and the Tika Pipes pipeline.
Tika 3.x and earlier used XML configuration (tika-config.xml). See the
Migration Guide for details on converting to JSON.
|
Top-level JSON structure
A tika-config.json is a single JSON object whose keys are the top-level sections
listed below. Every section is optional — omit what you don’t need. Defaults are
used wherever a section is missing.
{
"parsers": [ /* parser declarations */ ],
"detectors": [ /* detector declarations */ ],
"encoding-detectors": [ /* encoding detector declarations */ ],
"content-handler-factory": { /* handler type for emitted content */ },
"parse-context": {
"timeout-limits": { /* progress + total task timeouts */ },
"unpack-config": { /* embedded-byte extraction */ }
/* other SelfConfiguring components by component name */
},
"server": { /* tika-server options: enableUnsecureFeatures, cors, ... */ },
"pipes": { /* Pipes process management: numClients, parseMode, ... */ },
"fetchers": { /* named fetcher instances */ },
"emitters": { /* named emitter instances */ },
"pipes-iterator": { /* iterator (one per pipeline) */ },
"pipes-reporters": { /* per-document status reporters */ },
"plugin-roots": "/path/to/plugins"
}
Per-section documentation:
-
parsers,detectors,encoding-detectors,content-handler-factory,parse-context— covered below under Topics. -
server— see Tika Server. -
pipes,fetchers,emitters,pipes-iterator,pipes-reporters,plugin-roots— see Pipes Configuration and Tika Pipes.
Topics
Parser Configuration
-
PDFParser — PDF parsing options
-
TesseractOCRParser — OCR options for image-based text extraction
-
Tess4J OCR Parser — in-process OCR via tess4j JNI bindings
-
VLM Parsers — Claude, Gemini, OpenAI, Ollama, vLLM
-
External Parser — wrap external tools (ffmpeg, exiftool, etc.)
Other Configuration
-
Digesters — Computing cryptographic hashes of documents
-
Encoding Detectors — Configuring charset/encoding detection