Tika Pipes
This section covers Tika Pipes for scalable, fault-tolerant document processing.
Overview
Tika Pipes provides a framework for fault-tolerant, scalable document processing. Each document is parsed in a forked JVM with configurable timeouts and memory limits, so a single malformed file cannot crash or hang your application.
While Tika Pipes has a programmatic Java API, it is best used through:
-
tika-app — batch processing from the command line
-
tika-server — REST API with pipes-based robustness built in
-
tika-grpc — gRPC API with pipes-based robustness built in
See Robustness for details on how Tika Pipes protects against problematic files.
Topics
-
Getting Started — complete working example with tika-app
-
Fetchers — all available document sources (filesystem, S3, HTTP, GCS, Azure, etc.)
-
Emitters — all available output destinations (filesystem, ES, OpenSearch, Solr, S3, Kafka, etc.)
-
Iterators — document enumeration (directory walk, S3 listing, CSV, JDBC, Kafka, etc.)
-
Reporters — track per-document processing status
-
Pipeline Configuration — numClients, timeouts, JVM args, parse modes, emit batching
-
Parse Modes — control how documents are parsed and emitted (
RMETA,CONCATENATE,CONTENT_ONLY,NO_PARSE,UNPACK) -
Extracting Embedded Bytes — extract raw bytes from embedded documents
-
Timeouts — two-tier timeout system for handling long-running and hung parsers
Advanced Topics
-
Shared Server Mode - Experimental mode for reduced memory usage