Tess4J OCR Parser
The Tess4JParser is an OCR parser that calls the Tesseract native library
in-process via Tess4J and JNA, rather
than spawning a tesseract child process for every image. This eliminates
per-file process-spawn overhead and can be significantly faster when
processing large batches of images.
Because the native Tesseract handle is not thread-safe, the parser
maintains a configurable pool of Tesseract instances. Multiple threads
borrow from the pool and return instances when done, so the parser is safe
for concurrent use.
|
This parser loads native C/C++ libraries (Tesseract, Leptonica) into the JVM via JNA. A segfault or memory leak in the native code will crash your entire JVM. You should run this parser in a forked child process using tika-pipes, ideally inside a Docker container. Do not load it into a long-lived application server process unless you are comfortable with the risk. |
Module dependency
The parser lives in the tika-parser-tess4j-module artifact:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parser-tess4j-module</artifactId>
<version>${tika.version}</version>
</dependency>
Prerequisites
You must have the Tesseract and Leptonica shared libraries installed on the machine where the parser runs. The tess4j jar bundles Windows DLLs only — on macOS and Linux you are responsible for installing the native libraries yourself.
-
Debian / Ubuntu:
apt-get install libtesseract-dev libleptonica-dev tesseract-ocr-eng -
RHEL / Fedora:
dnf install tesseract-devel leptonica-devel tesseract-langpack-eng -
macOS (Homebrew):
brew install tesseract
You also need the tessdata language files. The dataPath configuration
option must point to the directory containing them (e.g.,
/usr/share/tesseract-ocr/5/tessdata).
Native library path (jna.library.path)
|
JNA must be able to find You have several options:
You are on your own here. The correct path depends entirely on your OS, distribution, and how you installed Tesseract. Common values:
|
Basic Configuration
{
"parsers": [
{
"name": "tess4j-parser",
"dataPath": "/usr/share/tesseract-ocr/5/tessdata",
"nativeLibPath": "/usr/lib/x86_64-linux-gnu",
"poolSize": 4
}
]
}
Full Configuration
{
"parsers": [
{
"name": "tess4j-parser",
"dataPath": "/usr/share/tesseract-ocr/5/tessdata",
"nativeLibPath": "/usr/lib/x86_64-linux-gnu",
"language": "eng",
"pageSegMode": 1,
"ocrEngineMode": 3,
"poolSize": 4,
"timeoutSeconds": 120,
"dpi": 300,
"minFileSizeToOcr": 0,
"maxFileSizeToOcr": 2147483647,
"skipOcr": false
}
]
}
Configuration options reference
| Property | Default | Description |
|---|---|---|
|
|
Path to the tessdata directory containing language data files. Required on macOS and Linux. |
|
|
Path to the directory containing |
|
|
Tesseract language(s). Multiple languages separated by |
|
|
Page segmentation mode (0-13). 1 = automatic with OSD. |
|
|
OCR engine mode. 0 = legacy, 1 = LSTM only, 2 = legacy + LSTM, 3 = default (whatever is available). |
|
|
Number of |
|
|
Maximum time (seconds) to wait for a pooled |
|
|
DPI for image rendering. |
|
|
Minimum input file size in bytes. Smaller files are skipped. |
|
|
Maximum input file size in bytes. Larger files are skipped. |
|
|
Runtime kill-switch to disable the parser entirely. |
Recommended: Docker + tika-pipes
Because this parser loads native code into the JVM, the safest deployment is a Docker container running tika-pipes with forked child processes. If the native code crashes, only the child process dies — tika-pipes will respawn it automatically.
A minimal Dockerfile:
FROM eclipse-temurin:21-jre
RUN apt-get update && \
apt-get install -y --no-install-recommends \
libtesseract-dev \
libleptonica-dev \
tesseract-ocr-eng && \
rm -rf /var/lib/apt/lists/*
# Copy your tika-pipes application and config
COPY target/tika-pipes-app.jar /app/tika-pipes-app.jar
COPY tika-config.json /app/tika-config.json
WORKDIR /app
ENTRYPOINT ["java", "-jar", "tika-pipes-app.jar"]
With the following parser configuration:
{
"parsers": [
{
"name": "tess4j-parser",
"dataPath": "/usr/share/tesseract-ocr/5/tessdata",
"nativeLibPath": "/usr/lib/x86_64-linux-gnu",
"poolSize": 4
}
]
}
Set poolSize equal to the number of forked parser threads to
maximize throughput without over-allocating native memory.
|
Tess4J vs. TesseractOCRParser
| Aspect | TesseractOCRParser |
Tess4JParser |
|---|---|---|
How it calls Tesseract |
Spawns a new |
Calls the native library in-process via JNA |
Startup overhead |
Process fork + exec per file |
One-time JNA initialization; pooled thereafter |
Thread safety |
Naturally safe (separate processes) |
Safe via pooled instances |
Crash isolation |
Child process crashes do not affect the JVM |
A native crash will take down the JVM |
Dependencies |
|
|
Best for |
Safety-first deployments, light OCR workloads |
High-throughput batch processing in Docker / tika-pipes |
Per-request configuration
Override configuration for a single parse call by placing a Tess4JConfig
on the ParseContext:
Tess4JConfig override = new Tess4JConfig();
override.setLanguage("fra");
override.setPageSegMode(6);
ParseContext context = new ParseContext();
context.set(Tess4JConfig.class, override);
Note: dataPath and nativeLibPath cannot be changed at parse time
(they are locked at parser initialization). Attempting to set them in a
runtime config will throw TikaConfigException.
@since Apache Tika 4.0