Getting Started with Tika Pipes

Table of Contents

Quick Start with tika-app
- Common Options
JSON Configuration
How It Works
Pipeline Configuration Options
Next Steps

This guide walks through a complete working example: reading files from a directory, parsing them, and writing JSON metadata to an output directory.

Quick Start with tika-app

The simplest way to use Tika Pipes is through tika-app:

java -jar tika-app.jar -i /data/input -o /data/output

This recursively processes all files in /data/input and writes one .json file per document to /data/output. Each JSON file contains the extracted metadata and text content.

Common Options

# Use 4 parallel forked processes
java -jar tika-app.jar -i /data/input -o /data/output -n 4

# Set memory limit per forked process
java -jar tika-app.jar -i /data/input -o /data/output -n 4 -X 512m

# Set parse timeout (milliseconds)
java -jar tika-app.jar -i /data/input -o /data/output -T 120000

# Extract plain text only (no HTML)
java -jar tika-app.jar -i /data/input -o /data/output --handler t

# Recursively unpack all embedded documents
java -jar tika-app.jar -i /data/input -o /data/output -Z

Handler types: t (text), h (html), x (xml), m (markdown), b (body), i (ignore/metadata only). The default is m (markdown).

JSON Configuration

For more control, create a JSON config file. Here is a complete filesystem-to-filesystem pipeline:

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "FETCHER_BASE_PATH",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "EMITTER_BASE_PATH",
        "fileExtension": "json",
        "onExists": "EXCEPTION"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "FETCHER_BASE_PATH",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4,
    "emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
    "forkedJvmArgs": ["-Xmx512m"],
    "emitStrategy": {
      "type": "DYNAMIC",
      "thresholdBytes": 1000000
    }
  },
  "auto-detect-parser": {
    "throwOnZeroBytes": false
  },
  "parse-context": {
    "mock-digester-factory": {},
    "timeout-limits": {
      "progressTimeoutMillis": 5000
    }
  },
  "plugin-roots": "PLUGINS_PATHS"
}

View source on GitHub

The values shown like FETCHER_BASE_PATH, EMITTER_BASE_PATH, PLUGINS_PATHS, and EMIT_INTERMEDIATE_RESULTS are placeholders the integration tests substitute at runtime. Replace them with real paths (or, for EMIT_INTERMEDIATE_RESULTS, the boolean true/false) in your own config.

Run it with:

java -jar tika-app.jar --config=tika-config.json -i /data/input -o /data/output

The -i and -o flags override the basePath values in the config when used with tika-app. The config file is useful for setting other options like extractFileSystemMetadata, onExists, and prettyPrint.

How It Works

A Tika Pipes pipeline has four components:

Pipes Iterator — enumerates the documents to process (e.g., walk a directory, list an S3 bucket, query a database)
Fetcher — retrieves each document’s bytes (e.g., read from filesystem, download from S3)
Parsers — extract text and metadata (runs in a forked JVM for robustness)
Emitter — writes the results (e.g., JSON to filesystem, index to Elasticsearch)

Iterator --> Fetcher --> [forked JVM: Parse] --> Emitter

Each parse runs in an isolated forked process with configurable timeouts and memory limits. If a parse hangs or crashes, only that forked process is affected — the pipeline continues with the remaining documents.

Pipeline Configuration Options

The pipes section controls the pipeline behavior:

Field Default Description

Field	Default	Description
`numClients`	`4`	Number of parallel forked parse processes
`parseMode`	`RMETA`	Output mode: `RMETA` (full recursive metadata), `CONCATENATE`, `CONTENT_ONLY`, `UNPACK`
`socketTimeoutMs`	`60000`	Maximum time (ms) for a single parse operation
`maxFilesProcessedPerProcess`	`10000`	Restart forked processes after this many files (prevents memory leaks)
`onParseException`	`EMIT`	What to do on parse failure: `EMIT` (emit error metadata), `SKIP`

numClients

4

Number of parallel forked parse processes

parseMode

RMETA

Output mode: RMETA (full recursive metadata), CONCATENATE, CONTENT_ONLY, UNPACK

socketTimeoutMs

60000

Maximum time (ms) for a single parse operation

maxFilesProcessedPerProcess

10000

Restart forked processes after this many files (prevents memory leaks)

onParseException

EMIT

What to do on parse failure: EMIT (emit error metadata), SKIP

See Parse Modes and Timeouts for details.

Next Steps

Fetchers — all available document sources
Emitters — all available output destinations
Iterators — all available document enumeration methods
Reporters — track processing status