Getting Started with the Java API
Before You Start
Before embedding Tika directly in your Java application, consider whether a client-server architecture would better suit your needs.
Recommended: Use tika-server or tika-grpc
For most use cases, we recommend running Tika as a separate service rather than embedding it directly:
-
tika-server - REST API, language-agnostic
-
tika-grpc - High-performance gRPC protocol
Why?
-
Process isolation - Parser crashes don’t affect your application
-
Easier deployment - Use official Docker images
-
Language flexibility - Call from any language, not just Java
-
Simpler upgrades - Update Tika independently of your application
Docker images are available at Docker Hub.
Using PipesForkParser (Recommended)
If you must use Tika as a library, use PipesForkParser from the
tika-pipes-fork-parser module. It provides process isolation to protect your
application from parser crashes, memory leaks, and infinite loops.
Maven Dependency
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-pipes-fork-parser</artifactId>
<version>${tika.version}</version>
</dependency>
Basic Example
import java.nio.file.Path;
import org.apache.tika.pipes.fork.PipesForkParser;
import org.apache.tika.pipes.fork.PipesForkResult;
Path file = Path.of("/path/to/document.pdf");
try (PipesForkParser parser = new PipesForkParser()) {
PipesForkResult result = parser.parse(file);
if (result.isSuccess()) {
String content = result.getContent();
// process content...
} else {
// handle failure
}
}
Key Features
-
Process isolation - Parsing runs in a separate JVM
-
Automatic restart - If the forked process crashes, it restarts automatically
-
Configurable timeouts - Prevent infinite loops
-
Thread-safe - Reuse across multiple threads
Complete Examples
See
PipesForkParserExample.java
in the tika-example module for comprehensive examples including:
-
Basic parsing
-
Handling embedded documents
-
Custom configuration
-
Error handling
-
Batch processing
Without Pipes: Understanding the Risks
If you choose not to use PipesForkParser and instead use Tika’s parsers directly
(e.g., AutoDetectParser), you are responsible for handling the risks of parsing
untrusted content.
| Running parsers directly on untrusted data can cause OutOfMemoryErrors, infinite loops, and crashes that will affect your entire application. |
Before proceeding without process isolation, read:
-
The Robustness of Apache Tika - Understanding parser risks and mitigations
-
Apache Tika Security Model - Trust boundaries and assumptions
If you still need to use parsers directly, your application is responsible for implementing its own process isolation so that you can:
-
Set parse timeouts (Tika cannot enforce timeouts without process isolation)
-
Configure memory limits (requires separate JVM)
-
Kill runaway processes
-
Recover from crashes
Never run Tika in the same JVM as critical infrastructure.