Tika Command Line Interface
| The tika-app command line interface is still in flux for 4.x. Options and behavior may change before the final release. |
This section covers using Apache Tika from the command line via tika-app.
Overview
The Tika application (tika-app.jar) is a standalone command line utility for extracting
text content and metadata from all sorts of files.
Command Line Options
Help and Information
| Option | Description |
|---|---|
|
Display usage instructions |
|
Enable debug-level output |
|
Show version details |
Examples
Pipeline processing
Extract text from a remote document and search for keywords:
curl http://example.com/document.doc | java -jar tika-app.jar --text | grep -q keyword
Batch Processing (tika-async-cli)
For processing large numbers of files, use tika-async-cli. It uses the Tika Pipes
architecture with forked JVM processes for fault tolerance.
Basic Batch Usage
java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output
This processes all files in the input directory and writes JSON metadata (RMETA format) to the output directory.
Batch Options
| Option | Description |
|---|---|
|
Input directory |
|
Output directory |
|
Content handler type: |
|
Concatenate content from all embedded documents into a single content field |
|
Output only extracted content (no metadata, no JSON wrapper); implies |
|
Timeout for each parse in milliseconds |
|
Number of parallel forked processes |
|
Plugins directory |
Batch Examples
Extract markdown content only (no metadata) from all files:
java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output -h m --content-only
This produces .md files in the output directory containing just the extracted markdown
content — no JSON wrappers, no metadata fields.
Extract text with all metadata in concatenated mode:
java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output --concatenate