Migrating Tika Server to 4.x
Overview
Tika Server 4.x introduces pipes-based parsing, which provides process isolation for all parsing operations. This improves stability and resource management but introduces some breaking changes.
New /tika Endpoint Structure
The /tika endpoint has been simplified with path-based routing:
| Method | Path | Config? | Output |
|---|---|---|---|
PUT |
|
- |
raw XHTML |
PUT |
|
- |
raw text |
PUT |
|
- |
raw HTML |
PUT |
|
- |
raw XML |
PUT |
|
- |
JSON (text handler) |
PUT |
|
- |
JSON with specified handler (text, html, xml) |
POST |
|
YES |
raw output (multipart with optional config) |
POST |
|
YES |
JSON output (multipart with optional config) |
Using PUT endpoints (simple)
# Get plain text
curl -T document.pdf http://localhost:9998/tika/text
# Get JSON with metadata and text
curl -T document.pdf http://localhost:9998/tika/json
# Get JSON with HTML content
curl -T document.pdf http://localhost:9998/tika/json/html
Using POST endpoints (with configuration)
POST endpoints accept multipart requests with a file part and optional config part:
# Parse with custom PDF parser settings
curl -X POST http://localhost:9998/tika/json \
-F "file=@document.pdf" \
-F "config={\"pdf-parser\":{\"ocrStrategy\":\"no_ocr\"}};type=application/json"
Breaking Changes
Removed Endpoints
/tika/main and /tika/form/main (Boilerpipe)
The Boilerpipe content extraction endpoints have been removed. These endpoints used BoilerpipeContentHandler which is not compatible with pipes-based parsing.
Migration: Use /tika/text for plain text extraction.
Accept Header Routing Removed
The /tika endpoint no longer routes based on Accept headers. Use explicit paths instead:
-
Accept: text/plain→ use/tika/text -
Accept: text/html→ use/tika/html -
Accept: application/json→ use/tika/json
Configuration Changes
Required: Pipes Configuration
All tika-server configurations must now include a pipes section and a file-system-fetcher:
{
"fetchers": {
"file-system-fetcher": {
"file-system-fetcher": {
"allowAbsolutePaths": true
}
}
},
"pipes": {
"numClients": 2,
"timeoutMillis": 30000
},
"plugin-roots": "path/to/plugins"
}
New Features
Advanced: Shared Server Mode
For memory-constrained environments, an experimental shared server mode is available. Instead of running N separate server processes (one per client), all clients share a single server process.
|
This mode sacrifices reliability for reduced memory usage. One crash, OOM, or timeout affects all in-flight requests. Only use if you fully understand the tradeoffs. |
{
"pipes": {
"numClients": 4,
"useSharedServer": true,
"forkedJvmArgs": ["-Xmx4g"]
}
}
See Shared Server Mode for details.