Tika Pipes

This section covers Tika Pipes for scalable, fault-tolerant document processing.

Overview

Tika Pipes provides a framework for fault-tolerant, scalable document processing. Each document is parsed in a forked JVM with configurable timeouts and memory limits, so a single malformed file cannot crash or hang your application.

While Tika Pipes has a programmatic Java API, it is best used through:

  • tika-app — batch processing from the command line

  • tika-server — REST API with pipes-based robustness built in

  • tika-grpc — gRPC API with pipes-based robustness built in

See Robustness for details on how Tika Pipes protects against problematic files.

Key Components

  • Fetchers — retrieve documents from various sources (filesystem, S3, HTTP, etc.)

  • Emitters — send parsed results to various destinations (filesystem, OpenSearch, ES-compatible, Solr, etc.)

  • Pipelines — configure processing workflows

Topics

  • Getting Started — complete working example with tika-app

  • Fetchers — all available document sources (filesystem, S3, HTTP, GCS, Azure, etc.)

  • Emitters — all available output destinations (filesystem, ES, OpenSearch, Solr, S3, Kafka, etc.)

  • Iterators — document enumeration (directory walk, S3 listing, CSV, JDBC, Kafka, etc.)

  • Reporters — track per-document processing status

  • Pipeline Configuration — numClients, timeouts, JVM args, parse modes, emit batching

  • Parse Modes — control how documents are parsed and emitted (RMETA, CONCATENATE, CONTENT_ONLY, NO_PARSE, UNPACK)

  • Extracting Embedded Bytes — extract raw bytes from embedded documents

  • Timeouts — two-tier timeout system for handling long-running and hung parsers

Advanced Topics