Tika Command Line Interface

The tika-app command line interface is still in flux for 4.x. Options and behavior may change before the final release.

This section covers using Apache Tika from the command line via tika-app.

Overview

The Tika application (tika-app) is a command line utility for extracting text content and metadata from all sorts of files.

Installation

As of 4.x, tika-app is distributed as a zip archive rather than a single self-contained jar. The bare tika-app-<version>.jar is only a thin launcher and will fail with NoClassDefFoundError if run on its own — the parsers and supporting modules (including the Tika Pipes processor) live in the adjacent lib/ directory.

Download tika-app-<version>.zip, unzip it, and run tika-app-<version>.jar from inside the unzipped directory so that lib/ and plugins/ sit alongside the jar:

unzip tika-app-<version>.zip
cd tika-app-<version>
java -jar tika-app-<version>.jar [option...] [file|port...]

The examples below use tika-app.jar as shorthand for the versioned jar in the unzipped distribution.

Basic Usage

java -jar tika-app.jar [option...] [file|port...]

Command Line Options

Help and Information

Option Description

-? or --help

Display usage instructions

-v or --verbose

Enable debug-level output

-V or --version

Show version details

Operation Modes

Option Description

-g or --gui

Launch the graphical interface

-s or --server

Start the web server

-f or --fork

Enable fork mode for isolated extraction

Output Formatting

Option Description

-x or --xml

Output XHTML (default)

-h or --html

Output HTML

-t or --text

Output plain text

--md

Output Markdown

-m or --metadata

Output metadata only

-j or --json

Output JSON metadata

Examples

Extract text from a file

java -jar tika-app.jar --text document.pdf

Extract metadata as JSON

java -jar tika-app.jar --json document.docx

Pipeline processing

Extract text from a remote document and search for keywords:

curl http://example.com/document.doc | java -jar tika-app.jar --text | grep -q keyword

Tika Pipes processing

Process many documents by specifying input and output paths. Inputs can be a local directory, S3, GCS, Azure, JDBC, and others via Tika Pipes fetchers:

java -jar tika-app.jar -i /path/to/input -o /path/to/output

Extract Markdown from a file

java -jar tika-app.jar --md document.docx

Custom configuration

Use a custom configuration file:

java -jar tika-app.jar --config=tika-config.json document.pdf

Tika Pipes Processing

For processing many documents — from a local directory, S3, GCS, Azure, JDBC, or any other Tika Pipes source — run tika-app with input/output paths. Under the hood this is Tika Pipes, dispatched asynchronously into forked JVM processes for fault tolerance. Tika prints a one-line banner to stderr when it switches into Pipes mode so you can confirm which path is running.

How Pipes mode is activated

tika-app enters Pipes mode automatically when any of the following are true:

  • Two positional arguments are given and the first is an existing directory (tika-app.jar /in /out).

  • Any of these options are present: -i, -o, --input, --output, --fileList, -z/-Z/--extract/--extract-dir, or -a/--async.

  • A single .json argument is given — it is treated as a Tika Pipes config file.

Anything else (single file, URL, stdin, --gui, --server) stays in standard single-document mode.

Basic Pipes Usage

java -jar tika-app.jar -i /path/to/input -o /path/to/output

This processes all files in the input directory and writes JSON metadata (RMETA format) to the output directory.

Tika Pipes Options

Option Description

-i

Input directory

-o

Output directory

--handler

Content handler type: t=text, h=html, x=xml, m=markdown, b=body, i=ignore (default: t)

--concatenate

Concatenate content from all embedded documents into a single content field

--content-only

Output only extracted content (no metadata, no JSON wrapper); implies --concatenate

--on-exists

Behavior when an output file already exists: exception (default), replace or skip

-T or --timeoutMs

Timeout for each parse in milliseconds

-n or --numClients

Number of parallel forked processes

-p or --pluginsDir

Plugins directory

Tika Pipes Examples

Extract markdown content only (no metadata) from all files:

java -jar tika-app.jar -i /path/to/input -o /path/to/output --handler m --content-only

This produces .md files in the output directory containing just the extracted markdown content — no JSON wrappers, no metadata fields.

Extract text with all metadata in concatenated mode:

java -jar tika-app.jar -i /path/to/input -o /path/to/output --concatenate

Use a Tika config file alongside the Pipes options. Both --config=foo.json (the standard-mode long form) and -c foo.json work:

java -jar tika-app.jar -i /path/to/input -o /path/to/output --config=tika-config.json