Tika Pipes

Table of Contents

Overview
- Key Components
Topics
Advanced Topics

This section covers Tika Pipes for scalable, fault-tolerant document processing.

Overview

Tika Pipes provides a framework for fault-tolerant, scalable document processing. Each document is parsed in a forked JVM with configurable timeouts and memory limits, so a single malformed file cannot crash or hang your application.

While Tika Pipes has a programmatic Java API, it is best used through:

tika-app — batch processing from the command line
tika-server — REST API with pipes-based robustness built in
tika-grpc — gRPC API with pipes-based robustness built in. More exposed by default than tika-server; run only on a trusted network (see Security).

See Robustness for details on how Tika Pipes protects against problematic files.

Key Components

Fetchers — retrieve documents from various sources (filesystem, S3, HTTP, etc.)
Emitters — send parsed results to various destinations (filesystem, OpenSearch, ES-compatible, Solr, etc.)
Pipelines — configure processing workflows

Topics

Getting Started — complete working example with tika-app
Fetchers — all available document sources (filesystem, S3, HTTP, GCS, Azure, etc.)
Emitters — all available output destinations (filesystem, ES, OpenSearch, Solr, S3, Kafka, etc.)
Iterators — document enumeration (directory walk, S3 listing, CSV, JDBC, Kafka, etc.)
Reporters — track per-document processing status
Pipeline Configuration — numClients, timeouts, JVM args, parse modes, emit batching
Parse Modes — control how documents are parsed and emitted (RMETA, CONCATENATE, CONTENT_ONLY, NO_PARSE, UNPACK)
Extracting Embedded Bytes — extract raw bytes from embedded documents
Timeouts — two-tier timeout system for handling long-running and hung parsers

Advanced Topics

Shared Server Mode - Experimental mode for reduced memory usage