Advanced Topics
This section covers advanced usage and internals of Apache Tika.
Topics
-
Language Detection - Built-in bigram language detector, training pipeline, and comparison with OpenNLP
-
Building the Language Detector - Full training log, decisions, and benchmark results vs OpenNLP
-
Robustness - Process isolation and fault tolerance when parsing untrusted content
-
Setting Limits - Limiting metadata size and resource consumption when processing untrusted content
-
TikaInputStream and Spooling - Understanding how TikaInputStream handles buffering, caching, and spooling to disk
-
Embedded Document Metadata - Understanding how Tika tracks embedded documents and their paths
-
ZIP Detection and Salvaging - How Tika detects and recovers truncated ZIP-based files
-
Running a Local VLM Server - Run an open-source VLM locally as an OpenAI-compatible endpoint for air-gapped OCR
Integration Testing
-
Testing with Tika App - Integration testing strategies for Tika App
-
Testing with Tika Server - Integration testing strategies for Tika Server