Tika Command Line Interface

The tika-app command line interface is still in flux for 4.x. Options and behavior may change before the final release.

This section covers using Apache Tika from the command line via tika-app.

Overview

The Tika application (tika-app.jar) is a standalone command line utility for extracting text content and metadata from all sorts of files.

Basic Usage

java -jar tika-app.jar [option...] [file|port...]

Command Line Options

Help and Information

Option Description

-? or --help

Display usage instructions

-v or --verbose

Enable debug-level output

-V or --version

Show version details

Operation Modes

Option Description

-g or --gui

Launch the graphical interface

-s or --server

Start the web server

-f or --fork

Enable fork mode for isolated extraction

Output Formatting

Option Description

-x or --xml

Output XHTML (default)

-h or --html

Output HTML

-t or --text

Output plain text

--md

Output Markdown

-m or --metadata

Output metadata only

-j or --json

Output JSON metadata

Examples

Extract text from a file

java -jar tika-app.jar --text document.pdf

Extract metadata as JSON

java -jar tika-app.jar --json document.docx

Pipeline processing

Extract text from a remote document and search for keywords:

curl http://example.com/document.doc | java -jar tika-app.jar --text | grep -q keyword

Batch processing

Process entire directories by specifying input and output paths:

java -jar tika-app.jar -i /path/to/input -o /path/to/output

Extract Markdown from a file

java -jar tika-app.jar --md document.docx

Custom configuration

Use a custom configuration file:

java -jar tika-app.jar --config=tika-config.json document.pdf

Batch Processing (tika-async-cli)

For processing large numbers of files, use tika-async-cli. It uses the Tika Pipes architecture with forked JVM processes for fault tolerance.

Basic Batch Usage

java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output

This processes all files in the input directory and writes JSON metadata (RMETA format) to the output directory.

Batch Options

Option Description

-i

Input directory

-o

Output directory

-h or --handlerType

Content handler type: t=text, h=html, x=xml, m=markdown, b=body, i=ignore (default: t)

--concatenate

Concatenate content from all embedded documents into a single content field

--content-only

Output only extracted content (no metadata, no JSON wrapper); implies --concatenate

-T or --timeoutMs

Timeout for each parse in milliseconds

-n or --numClients

Number of parallel forked processes

-p or --pluginsDir

Plugins directory

Batch Examples

Extract markdown content only (no metadata) from all files:

java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output -h m --content-only

This produces .md files in the output directory containing just the extracted markdown content — no JSON wrappers, no metadata fields.

Extract text with all metadata in concatenated mode:

java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output --concatenate