Getting Started with Tika Pipes

This guide walks through a complete working example: reading files from a directory, parsing them, and writing JSON metadata to an output directory.

Quick Start with tika-app

The simplest way to use Tika Pipes is through tika-app:

java -jar tika-app.jar -i /data/input -o /data/output

This recursively processes all files in /data/input and writes one .json file per document to /data/output. Each JSON file contains the extracted metadata and text content.

Common Options

# Use 4 parallel forked processes
java -jar tika-app.jar -i /data/input -o /data/output -n 4

# Set memory limit per forked process
java -jar tika-app.jar -i /data/input -o /data/output -n 4 -X 512m

# Set parse timeout (milliseconds)
java -jar tika-app.jar -i /data/input -o /data/output -T 120000

# Extract plain text only (no HTML)
java -jar tika-app.jar -i /data/input -o /data/output --handler t

# Recursively unpack all embedded documents
java -jar tika-app.jar -i /data/input -o /data/output -Z

Handler types: t (text), h (html), x (xml), m (markdown), b (body), i (ignore/metadata only).

JSON Configuration

For more control, create a JSON config file. Here is a complete filesystem-to-filesystem pipeline:

{
  "fetchers": [
    {
      "file-system-fetcher": {
        "id": "input-fetcher",
        "basePath": "/data/input",
        "extractFileSystemMetadata": true
      }
    }
  ],
  "emitters": [
    {
      "file-system-emitter": {
        "id": "output-emitter",
        "basePath": "/data/output",
        "fileExtension": "json",
        "onExists": "SKIP",
        "prettyPrint": false
      }
    }
  ],
  "parsers": [
    {
      "default-parser": {}
    }
  ]
}

Run it with:

java -jar tika-app.jar --config tika-config.json -i /data/input -o /data/output
The -i and -o flags override the basePath values in the config when used with tika-app. The config file is useful for setting other options like extractFileSystemMetadata, onExists, and prettyPrint.

How It Works

A Tika Pipes pipeline has four components:

  1. Pipes Iterator — enumerates the documents to process (e.g., walk a directory, list an S3 bucket, query a database)

  2. Fetcher — retrieves each document’s bytes (e.g., read from filesystem, download from S3)

  3. Parsers — extract text and metadata (runs in a forked JVM for robustness)

  4. Emitter — writes the results (e.g., JSON to filesystem, index to Elasticsearch)

Iterator --> Fetcher --> [forked JVM: Parse] --> Emitter

Each parse runs in an isolated forked process with configurable timeouts and memory limits. If a parse hangs or crashes, only that forked process is affected — the pipeline continues with the remaining documents.

Pipeline Configuration Options

The pipes section controls the pipeline behavior:

Field Default Description

numClients

4

Number of parallel forked parse processes

parseMode

RMETA

Output mode: RMETA (full recursive metadata), CONCATENATE, CONTENT_ONLY, UNPACK

socketTimeoutMs

60000

Maximum time (ms) for a single parse operation

maxFilesProcessedPerProcess

10000

Restart forked processes after this many files (prevents memory leaks)

onParseException

EMIT

What to do on parse failure: EMIT (emit error metadata), SKIP

See Parse Modes and Timeouts for details.

Next Steps

  • Fetchers — all available document sources

  • Emitters — all available output destinations

  • Iterators — all available document enumeration methods

  • Reporters — track processing status