Tika Command Line Interface
| The tika-app command line interface is still in flux for 4.x. Options and behavior may change before the final release. |
This section covers using Apache Tika from the command line via tika-app. The
authoritative option list is java -jar tika-app.jar --help — this page mirrors
that output and adds usage context. If the two disagree, --help wins; please
file a ticket.
Overview
The Tika application (tika-app) is a command line utility for extracting
text content and metadata from all sorts of files. It operates in three modes:
-
Standard mode — parse a single file, URL, or stdin and write the result to stdout.
-
GUI mode —
--guilaunches a desktop window for drag-and-drop parsing. -
Tika Pipes mode — process many documents from a directory (or S3, GCS, Azure, JDBC, etc.) via the asynchronous Pipes pipeline. Activated by any of the Pipes-only flags listed below.
Installation
As of 4.x, tika-app is distributed as a zip archive rather than a single
self-contained jar. The bare tika-app-<version>.jar is only a thin launcher and
will fail with NoClassDefFoundError if run on its own — the parsers and supporting
modules (including the Tika Pipes processor) live in the adjacent lib/ directory.
|
Download tika-app-<version>.zip, unzip it, and run tika-app-<version>.jar from
inside the unzipped directory so that lib/ and plugins/ sit alongside the jar:
unzip tika-app-<version>.zip
cd tika-app-<version>
java -jar tika-app-<version>.jar [option...] [file|port...]
The examples below use tika-app.jar as shorthand for the versioned jar in the
unzipped distribution.
Basic Usage
java -jar tika-app.jar [option...] [file|port...]
If no file or URL is given (or - is given), tika-app parses standard input.
If no arguments are given at all and no stdin is piped in, the GUI launches.
Standard-mode Options
These options apply to single-document parsing (the default mode). For Pipes-mode options see Tika Pipes Processing below.
Help and Information
| Option | Description |
|---|---|
|
Print the usage message |
|
Print debug-level messages |
|
Print the Apache Tika version |
Configuration
| Option | Description |
|---|---|
|
TikaConfig file (JSON as of Tika 4.x). Must appear before |
|
Convert a legacy 3.x XML config to 4.x JSON format (parsers section only) and write to stdout. Redirect to save, e.g. |
Output Formatting
| Option | Description |
|---|---|
|
Output XHTML content (default) |
|
Output HTML content |
|
Output plain text content (body) |
|
Output Markdown content (body) |
|
Output plain text — main content only, via the boilerpipe handler |
|
Output all text content |
|
Output metadata only |
|
Output metadata in JSON |
|
Output metadata in XMP |
|
Output metadata and content from all embedded files. Combine with |
|
For JSON, XML, and XHTML output, add newlines and whitespace for readability. |
|
Use output encoding |
Detection and Language
| Option | Description |
|---|---|
|
Detect the document type and print the media type. |
|
Detect and print only the language. |
Content Options
| Option | Description |
|---|---|
|
Use document password |
|
Include a digest of the parsed bytes. Supported via |
Attachment Extraction (single-document)
| Option | Description |
|---|---|
|
Extract all attachments into the current directory. WARNING: As of 4.x |
|
Target directory for |
|
Behavior when an output file already exists: |
|
Maximum depth for embedded document extraction. |
|
Maximum number of embedded documents to extract. |
Async Mode
| Option | Description |
|---|---|
|
Run Tika in async mode. Requires a |
Listing and Inspection
| Option | Description |
|---|---|
|
List the available document parsers. |
|
List the available parsers and their supported mime types. |
|
Same as |
|
List the available document detectors. |
|
List the available metadata models and their supported keys. |
|
List all known media types and related information. |
|
Compare Tika’s known media types to the |
Fork Mode (process isolation)
Fork mode parses the document in a separate JVM, protecting the main process from parser crashes, OOM, and timeouts.
| Option | Description |
|---|---|
|
Run parsing in a forked JVM process. |
|
Parse timeout in milliseconds (default: 60000). |
|
JVM args for the forked process, comma-separated. Example: |
|
Directory containing plugin zips for the forked process. |
Examples
Reading from stdin
Extract text from a remote document and search for keywords:
curl http://example.com/document.doc | java -jar tika-app.jar --text | grep -q keyword
tika-app reads from standard input when no file argument is given (or when - is given). For batch processing of many documents, see Tika Pipes Processing below.
Tika Pipes Processing
For processing many documents — from a local directory, S3, GCS, Azure, JDBC,
or any other Tika Pipes source — run tika-app with input/output paths.
Under the hood this is Tika Pipes, dispatched asynchronously into forked JVM
processes for fault tolerance. Tika prints a one-line banner to stderr when
it switches into Pipes mode so you can confirm which path is running.
How Pipes mode is activated
tika-app enters Pipes mode automatically when any of the following are true:
-
Two positional arguments are given and the first is an existing directory (
tika-app.jar /in /out). -
A single
.jsonargument is given — it is treated as a Tika Pipes config file. -
Any of these options are present:
-i,--input,-o,--output,--fileList,-z,--extract,--extract-dir,-Z, or-a/--async.
Anything else (single file, URL, stdin, --gui) stays in standard
single-document mode.
The activation list mixes standard-mode and Pipes-only flags (-z,
--extract, --extract-dir). Passing one of those with a single file routes
into Pipes mode and then fails because the async dispatcher expects an input
directory. If you want unpack-while-pipes behaviour, use the Pipes-specific
-Z instead.
|
Use the GNU-style double-dash form for long flags. --input /path
works; -input /path (single dash plus the long name) does not — tika-app
rejects single-dash long names with an IllegalArgumentException pointing
you at the right form. Single-letter short flags use one dash
(e.g., -i, -eUTF-8, -X512m).
|
Basic Pipes Usage
java -jar tika-app.jar -i /path/to/input -o /path/to/output
This processes all files in the input directory and writes JSON metadata (RMETA format) to the output directory.
Tika Pipes Options
Input and output
| Option | Description |
|---|---|
|
Input directory. |
|
Output directory. |
|
File list (one path per line, relative to |
|
Behavior when an output file already exists: |
Output formatting
| Option | Description |
|---|---|
|
Content handler type: |
|
Concatenate content from all embedded documents into a single content field. |
|
Output only the extracted content (no metadata, no JSON wrapper). Implies |
Execution
| Option | Description |
|---|---|
|
Number of parallel forked processes. |
|
|
|
Timeout for each parse in milliseconds. |
Configuration
| Option | Description |
|---|---|
|
Tika config file. |
|
Plugins directory. |
Unpack (recursive attachment extraction)
| Option | Description |
|---|---|
|
Recursively unpack all attachments. This is the Pipes-mode counterpart to standard-mode |
|
Output format: |
|
Output mode: |
|
Include |
Tika Pipes Examples
Extract markdown content only (no metadata) from all files:
java -jar tika-app.jar -i /path/to/input -o /path/to/output --handler m --content-only
This produces .md files in the output directory containing just the extracted markdown
content — no JSON wrappers, no metadata fields.
Extract text with all metadata in concatenated mode:
java -jar tika-app.jar -i /path/to/input -o /path/to/output --concatenate
Use a Tika config file alongside the Pipes options. Both --config=foo.json
(the standard-mode long form) and -c foo.json work:
java -jar tika-app.jar -i /path/to/input -o /path/to/output --config=tika-config.json
Recursively unpack attachments into the output directory:
java -jar tika-app.jar -i /path/to/input -o /path/to/output -Z