Getting Started with the Java API

Table of Contents

Before You Start
- Recommended: Use tika-server or tika-grpc
- When to Use the Java API
Using PipesForkParser (Recommended)
Without Pipes: Understanding the Risks

Before You Start

Before embedding Tika directly in your Java application, consider whether a client-server architecture would better suit your needs.

Recommended: Use tika-server or tika-grpc

For most use cases, we recommend running Tika as a separate service rather than embedding it directly:

tika-server - REST API, language-agnostic
tika-grpc - High-performance gRPC protocol

Why?

Process isolation - Parser crashes don’t affect your application
Easier deployment - Use official Docker images
Language flexibility - Call from any language, not just Java
Simpler upgrades - Update Tika independently of your application

Docker images are available at Docker Hub.

When to Use the Java API

The Java API is appropriate when you:

Need tight integration with Tika internals
Cannot use a network service
Have specific customization requirements

Using PipesForkParser (Recommended)

If you must use Tika as a library, use PipesForkParser from the tika-pipes-fork-parser module. It provides process isolation to protect your application from parser crashes, memory leaks, and infinite loops.

Maven Dependency

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-pipes-fork-parser</artifactId>
    <version>${tika.version}</version>
</dependency>

Basic Example

import java.nio.file.Path;
import org.apache.tika.pipes.fork.PipesForkParser;
import org.apache.tika.pipes.fork.PipesForkResult;

Path file = Path.of("/path/to/document.pdf");

try (PipesForkParser parser = new PipesForkParser()) {
    PipesForkResult result = parser.parse(file);

    if (result.isSuccess()) {
        String content = result.getContent();
        // process content...
    } else {
        // handle failure
    }
}

Key Features

Process isolation - Parsing runs in a separate JVM
Automatic restart - If the forked process crashes, it restarts automatically
Configurable timeouts - Prevent infinite loops
Thread-safe - Reuse across multiple threads

Complete Examples

See PipesForkParserExample.java in the tika-example module for comprehensive examples including:

Basic parsing
Handling embedded documents
Custom configuration
Error handling
Batch processing

Without Pipes: Understanding the Risks

If you choose not to use PipesForkParser and instead use Tika’s parsers directly (e.g., AutoDetectParser), you are responsible for handling the risks of parsing untrusted content.

Running parsers directly on untrusted data can cause OutOfMemoryErrors, infinite loops, and crashes that will affect your entire application.

Before proceeding without process isolation, read:

The Robustness of Apache Tika - Understanding parser risks and mitigations
Apache Tika Security Model - Trust boundaries and assumptions

If you still need to use parsers directly, your application is responsible for implementing its own process isolation so that you can:

Set parse timeouts (Tika cannot enforce timeouts without process isolation)
Configure memory limits (requires separate JVM)
Kill runaway processes
Recover from crashes

Never run Tika in the same JVM as critical infrastructure.