Timeouts
Overview
Tika Pipes uses a two-tier timeout system to handle both long-running tasks and hung parsers:
-
progressTimeoutMillis— Maximum time between progress updates. If no progress is reported within this interval, the task is considered stalled and killed. Default:60000(1 minute). -
totalTaskTimeoutMillis— Maximum wall-clock time for an entire task. Even if the parser is making progress, the task is killed after this time. Default:3600000(1 hour).
Parsers that never report progress effectively get progressTimeoutMillis as their total timeout.
Parsers that do report progress (e.g., OCR processing multiple pages) can run up to totalTaskTimeoutMillis.
Configuration
Timeouts are configured via TimeoutLimits in the parse-context section of your JSON configuration:
{
"parse-context": {
"timeout-limits": {
"totalTaskTimeoutMillis": 3600000,
"progressTimeoutMillis": 60000
}
}
}
This can be combined with other parse-context settings:
{
"pipes": {
"numClients": 4,
"forkedJvmArgs": ["-Xmx1g"]
},
"parse-context": {
"timeout-limits": {
"totalTaskTimeoutMillis": 7200000,
"progressTimeoutMillis": 120000
}
}
}
Per-Request Overrides
When using Tika Server with enableUnsecureFeatures: true, timeouts can be overridden per-request
by including TimeoutLimits in the ParseContext of a FetchEmitTuple:
ParseContext parseContext = new ParseContext();
parseContext.setJsonConfig("timeout-limits",
"{\"progressTimeoutMillis\": 300000}");
FetchEmitTuple t = new FetchEmitTuple("id",
new FetchKey("my-fetcher", "large-document.pdf"),
new EmitKey("my-emitter", "output-key"),
parseContext);
How It Works
Progress Tracking
When a task starts, the server creates a TikaProgressTracker and places it in the ParseContext.
Parsers that perform long-running external operations (OCR, VLM inference, etc.) call
TikaProgressTracker.update(context) after completing each unit of work:
// In a parser after completing an external process:
TikaProgressTracker.update(parseContext);
The server’s monitoring loop checks both timeouts on every heartbeat:
-
Has
totalTaskTimeoutMilliselapsed since the task started? → TIMEOUT -
Has
progressTimeoutMilliselapsed since the last progress update? → TIMEOUT
Which Parsers Report Progress?
The following parsers call TikaProgressTracker.update() after each external operation:
-
TesseractOCRParser— after each OCR invocation -
ExternalParser— after each external process completes -
GDALParser— after GDAL processing -
Tess4JParser— after each in-process OCR operation -
VLM parsers (
OllamaParser,ClaudeParser, etc.) — after each API call -
OpenAIImageEmbeddingParser— after each embedding call -
StringsParser— after the strings command completes
Parsers that don’t report progress (most built-in parsers) are bounded by progressTimeoutMillis alone.
Example: OCR Pipeline
For a pipeline processing scanned PDFs with hundreds of pages:
{
"parse-context": {
"timeout-limits": {
"totalTaskTimeoutMillis": 7200000,
"progressTimeoutMillis": 300000
}
}
}
This allows up to 2 hours total per document, but kills the task if any single OCR page takes longer than 5 minutes. A 200-page document where each page takes 30 seconds of OCR will complete successfully (~100 minutes total), while a document stuck on a single page will be killed after 5 minutes.
Example: Quick Batch Processing
For processing many small documents where you want fast failure:
{
"parse-context": {
"timeout-limits": {
"totalTaskTimeoutMillis": 30000,
"progressTimeoutMillis": 10000
}
}
}
CLI Usage
When using tika-app with --fork, the --fork-timeout flag sets progressTimeoutMillis:
java -jar tika-app.jar --fork --fork-timeout=120000 -i /input -o /output
Living Code Reference
-
TimeoutLimits.java— Configuration class with defaults and helper methods -
TikaProgressTracker.java— Progress tracking for parsers -
TikaProgressTrackerTest.java— Unit tests for the progress tracker -
TimeoutLimitsTest.java— Unit tests for TimeoutLimits serialization -
PipesClientTest.java— Integration tests including timeout behavior