Tika Command Line Interface
| The tika-app command line interface is still in flux for 4.x. Options and behavior may change before the final release. |
This section covers using Apache Tika from the command line via tika-app.
Overview
The Tika application (tika-app) is a command line utility for extracting
text content and metadata from all sorts of files.
Installation
As of 4.x, tika-app is distributed as a zip archive rather than a single
self-contained jar. The bare tika-app-<version>.jar is only a thin launcher and
will fail with NoClassDefFoundError if run on its own — the parsers and supporting
modules (including the Tika Pipes processor) live in the adjacent lib/ directory.
|
Download tika-app-<version>.zip, unzip it, and run tika-app-<version>.jar from
inside the unzipped directory so that lib/ and plugins/ sit alongside the jar:
unzip tika-app-<version>.zip
cd tika-app-<version>
java -jar tika-app-<version>.jar [option...] [file|port...]
The examples below use tika-app.jar as shorthand for the versioned jar in the
unzipped distribution.
Command Line Options
Help and Information
| Option | Description |
|---|---|
|
Display usage instructions |
|
Enable debug-level output |
|
Show version details |
Examples
Pipeline processing
Extract text from a remote document and search for keywords:
curl http://example.com/document.doc | java -jar tika-app.jar --text | grep -q keyword
Tika Pipes Processing
For processing many documents — from a local directory, S3, GCS, Azure, JDBC,
or any other Tika Pipes source — run tika-app with input/output paths.
Under the hood this is Tika Pipes, dispatched asynchronously into forked JVM
processes for fault tolerance. Tika prints a one-line banner to stderr when
it switches into Pipes mode so you can confirm which path is running.
How Pipes mode is activated
tika-app enters Pipes mode automatically when any of the following are true:
-
Two positional arguments are given and the first is an existing directory (
tika-app.jar /in /out). -
Any of these options are present:
-i,-o,--input,--output,--fileList,-z/-Z/--extract/--extract-dir, or-a/--async. -
A single
.jsonargument is given — it is treated as a Tika Pipes config file.
Anything else (single file, URL, stdin, --gui, --server) stays in standard
single-document mode.
Basic Pipes Usage
java -jar tika-app.jar -i /path/to/input -o /path/to/output
This processes all files in the input directory and writes JSON metadata (RMETA format) to the output directory.
Tika Pipes Options
| Option | Description |
|---|---|
|
Input directory |
|
Output directory |
|
Content handler type: |
|
Concatenate content from all embedded documents into a single content field |
|
Output only extracted content (no metadata, no JSON wrapper); implies |
|
Behavior when an output file already exists: |
|
Timeout for each parse in milliseconds |
|
Number of parallel forked processes |
|
Plugins directory |
Tika Pipes Examples
Extract markdown content only (no metadata) from all files:
java -jar tika-app.jar -i /path/to/input -o /path/to/output --handler m --content-only
This produces .md files in the output directory containing just the extracted markdown
content — no JSON wrappers, no metadata fields.
Extract text with all metadata in concatenated mode:
java -jar tika-app.jar -i /path/to/input -o /path/to/output --concatenate
Use a Tika config file alongside the Pipes options. Both --config=foo.json
(the standard-mode long form) and -c foo.json work:
java -jar tika-app.jar -i /path/to/input -o /path/to/output --config=tika-config.json