Advanced Topics
This section covers advanced usage and internals of Apache Tika.
Most pages here are written from a Java-API perspective. Where a topic
has a JSON-config or CLI equivalent, look first under
Configuration (per-parser options),
Tika Pipes (pipeline + Pipes-mode tuning),
Tika Server (REST + server config), or
Tika CLI (tika-app flags). The
Setting Limits page is the model — it
covers Java, JSON, and CLI side by side. Filing issues against specific
advanced pages where the JSON/CLI equivalent isn’t documented yet helps us
prioritize the gap.
|
Topics
-
Language Detection - Built-in bigram language detector, training pipeline, and comparison with OpenNLP
-
Building the Language Detector - Full training log, decisions, and benchmark results vs OpenNLP
-
Robustness - Process isolation and fault tolerance when parsing untrusted content
-
Setting Limits - Limiting metadata size and resource consumption when processing untrusted content
-
TikaInputStream and Spooling - Understanding how TikaInputStream handles buffering, caching, and spooling to disk
-
Embedded Document Metadata - Understanding how Tika tracks embedded documents and their paths
-
ZIP Detection and Salvaging - How Tika detects and recovers truncated ZIP-based files
-
Running a Local VLM Server - Run an open-source VLM locally as an OpenAI-compatible endpoint for air-gapped OCR
Integration Testing
-
Testing with Tika App - Integration testing strategies for Tika App
-
Testing with Tika Server - Integration testing strategies for Tika Server