Getting Started with the Java API

Before You Start

Before embedding Tika directly in your Java application, consider whether a client-server architecture would better suit your needs.

For most use cases, we recommend running Tika as a separate service rather than embedding it directly:

Why?

  • Process isolation - Parser crashes don’t affect your application

  • Easier deployment - Use official Docker images

  • Language flexibility - Call from any language, not just Java

  • Simpler upgrades - Update Tika independently of your application

Docker images are available at Docker Hub.

When to Use the Java API

The Java API is appropriate when you:

  • Need tight integration with Tika internals

  • Cannot use a network service

  • Have specific customization requirements

If you must use Tika as a library, use PipesForkParser from the tika-pipes-fork-parser module. It provides process isolation to protect your application from parser crashes, memory leaks, and infinite loops.

Maven Dependency

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-pipes-fork-parser</artifactId>
    <version>${tika.version}</version>
</dependency>

Basic Example

import java.nio.file.Path;
import org.apache.tika.pipes.fork.PipesForkParser;
import org.apache.tika.pipes.fork.PipesForkResult;

Path file = Path.of("/path/to/document.pdf");

try (PipesForkParser parser = new PipesForkParser()) {
    PipesForkResult result = parser.parse(file);

    if (result.isSuccess()) {
        String content = result.getContent();
        // process content...
    } else {
        // handle failure
    }
}

Key Features

  • Process isolation - Parsing runs in a separate JVM

  • Automatic restart - If the forked process crashes, it restarts automatically

  • Configurable timeouts - Prevent infinite loops

  • Thread-safe - Reuse across multiple threads

Complete Examples

See PipesForkParserExample.java in the tika-example module for comprehensive examples including:

  • Basic parsing

  • Handling embedded documents

  • Custom configuration

  • Error handling

  • Batch processing

Without Pipes: Understanding the Risks

If you choose not to use PipesForkParser and instead use Tika’s parsers directly (e.g., AutoDetectParser), you are responsible for handling the risks of parsing untrusted content.

Running parsers directly on untrusted data can cause OutOfMemoryErrors, infinite loops, and crashes that will affect your entire application.

Before proceeding without process isolation, read:

If you still need to use parsers directly, your application is responsible for implementing its own process isolation so that you can:

  • Set parse timeouts (Tika cannot enforce timeouts without process isolation)

  • Configure memory limits (requires separate JVM)

  • Kill runaway processes

  • Recover from crashes

Never run Tika in the same JVM as critical infrastructure.