Tika-Server Integration Testing
- Setup
- Part 1: Default Mode Tests
- Test 1: GET /version
- Test 2: PUT /detect/stream
- Test 3: PUT /tika/text
- Test 4: PUT /tika/html
- Test 5: PUT /tika/xml
- Test 6: PUT /tika/json
- Test 7: PUT /meta
- Test 8: PUT /meta/{field}
- Test 9: PUT /rmeta
- Test 10: PUT /rmeta/text
- Test 11: PUT /language/stream
- Test 12: PUT /unpack/all
- Test 13: GET /parsers
- Test 14: GET /detectors
- Test 15: GET /mime-types
- Test 16: POST /meta/form
- Test 17: POST /rmeta/form
- Test 18: Config Endpoints Blocked (Default Mode)
- Part 2: Tests with enableUnsecureFeatures
- Server Options
- Headers
- Error Handling
- Cleanup
- Usability Test Results
- Known Issues
- Quick Reference
- Implementation Notes
Integration tests for tika-server to be run from a distribution ZIP.
Setup
# Create test directory
mkdir -p /tmp/tika-server-test
cd /tmp/tika-server-test
# Copy and extract distribution
cp /path/to/tika-server-standard-4.0.0-SNAPSHOT-bin.zip .
unzip tika-server-standard-4.0.0-SNAPSHOT-bin.zip
# Copy test files
cp /path/to/test-documents/testPDF.pdf .
cp /path/to/test-documents/testHTML.html .
cp /path/to/test-documents/test_recursive_embedded.docx .
Part 1: Default Mode Tests
Start server in default mode (config endpoints disabled):
java -jar tika-server.jar --port 9998 &
sleep 8
curl -s http://localhost:9998/version
Test 2: PUT /detect/stream
curl -s -X PUT -T testPDF.pdf http://localhost:9998/detect/stream
Expected: application/pdf
Test 3: PUT /tika/text
curl -s -X PUT -T testPDF.pdf http://localhost:9998/tika/text
Expected: Plain text content extracted from PDF.
Test 4: PUT /tika/html
curl -s -X PUT -T testPDF.pdf http://localhost:9998/tika/html
Expected: HTML with metadata in <meta> tags and content in <body>.
Test 5: PUT /tika/xml
curl -s -X PUT -T testPDF.pdf http://localhost:9998/tika/xml
Expected: XHTML content (starts with <html xmlns=…>).
Test 6: PUT /tika/json
curl -s -X PUT -T testPDF.pdf http://localhost:9998/tika/json
Expected: JSON object with metadata and X-TIKA:content field.
Test 7: PUT /meta
curl -s -X PUT -H "Accept: application/json" -T testPDF.pdf http://localhost:9998/meta
Expected: JSON object with metadata only (no content).
Test 8: PUT /meta/{field}
curl -s -X PUT -T testPDF.pdf http://localhost:9998/meta/Content-Type
Expected: Content-Type,application/pdf
Test 9: PUT /rmeta
curl -s -X PUT -T test_recursive_embedded.docx http://localhost:9998/rmeta
Expected: JSON array with metadata for main document and all embedded documents.
Test 10: PUT /rmeta/text
curl -s -X PUT -T test_recursive_embedded.docx http://localhost:9998/rmeta/text
Expected: JSON array with ToTextContentHandler content.
Test 11: PUT /language/stream
curl -s -X PUT -T testPDF.pdf http://localhost:9998/language/stream
Expected: Two-letter language code (e.g., en, th).
Test 12: PUT /unpack/all
curl -s -X PUT -T test_recursive_embedded.docx http://localhost:9998/unpack/all -o /tmp/unpack.zip
unzip -l /tmp/unpack.zip
Expected: ZIP file containing extracted embedded files plus TEXT and METADATA files.
Test 13: GET /parsers
curl -s -H "Accept: text/plain" http://localhost:9998/parsers
Expected: Hierarchical list of available parsers.
Test 14: GET /detectors
curl -s -H "Accept: text/plain" http://localhost:9998/detectors
Expected: List of available detectors.
Test 15: GET /mime-types
curl -s -H "Accept: application/json" http://localhost:9998/mime-types
Expected: JSON object with all known MIME types.
Test 16: POST /meta/form
curl -s -X POST -F "upload=@testPDF.pdf" -H "Accept: application/json" http://localhost:9998/meta/form
Expected: JSON metadata from multipart form upload.
Test 17: POST /rmeta/form
curl -s -X POST -F "upload=@test_recursive_embedded.docx" http://localhost:9998/rmeta/form
Expected: JSON array with recursive metadata from multipart upload.
Test 18: Config Endpoints Blocked (Default Mode)
curl -s -w "\nHTTP Status: %{http_code}\n" -X POST -F "file=@testPDF.pdf" http://localhost:9998/meta/config
curl -s -w "\nHTTP Status: %{http_code}\n" -X POST -F "file=@testPDF.pdf" http://localhost:9998/rmeta/config
curl -s -w "\nHTTP Status: %{http_code}\n" -X POST -F "file=@testPDF.pdf" http://localhost:9998/tika/config
curl -s -w "\nHTTP Status: %{http_code}\n" -X POST -F "file=@testPDF.pdf" http://localhost:9998/unpack/config
Expected: All return HTTP 403 with message: "Config endpoints are disabled. Set enableUnsecureFeatures=true in server config."
Part 2: Tests with enableUnsecureFeatures
Stop the default server and create a config file:
pkill -f "tika-server.jar"
cat > tika-config-unsecure.json << 'EOF'
{
"server": {
"port": 9998,
"host": "localhost",
"enableUnsecureFeatures": true
},
"parsers": [
{"default-parser": {}}
],
"plugin-roots": "/tmp/tika-server-test/plugins"
}
EOF
java -jar tika-server.jar -c tika-config-unsecure.json &
sleep 10
curl -s http://localhost:9998/version
Test 19: POST /meta/config
curl -s -X POST -F "file=@testPDF.pdf" -H "Accept: application/json" http://localhost:9998/meta/config
Expected: JSON metadata.
Test 20: POST /meta/config with custom parser config
curl -s -X POST -F "file=@testPDF.pdf" \
-F 'config={"parsers":[{"pdf-parser":{"ocrStrategy":"NO_OCR"}}]}' \
-H "Accept: application/json" \
http://localhost:9998/meta/config
Expected: JSON metadata with custom PDF parser config applied.
Server Options
Headers
Usability Test Results
The following endpoints were tested and verified working:
Default Mode (enableUnsecureFeatures=false)
| Endpoint | Method | Status |
|---|---|---|
|
GET |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
PUT |
PASS |
|
GET |
PASS |
|
GET |
PASS |
|
GET |
PASS |
|
POST |
PASS |
|
POST |
PASS |
|
POST |
BLOCKED (403) - Expected |
|
POST |
BLOCKED (403) - Expected |
|
POST |
BLOCKED (403) - Expected |
|
POST |
BLOCKED (403) - Expected |
Quick Reference
Basic Parsing
# Text output
curl -X PUT -T file.pdf http://localhost:9998/tika/text
# HTML output
curl -X PUT -T file.pdf http://localhost:9998/tika/html
# JSON output (metadata + content)
curl -X PUT -T file.pdf http://localhost:9998/tika/json
Implementation Notes
Automatic Component Configuration
The server automatically configures the required fetcher and emitter for pipes-based parsing:
-
tika-server-fetcher: A file-system-fetcher with
basePathpointing to a dedicated temp directory for input files. This enables the/tika,/rmeta, and/metaendpoints to work with uploaded files. -
unpack-emitter: A file-system-emitter with
basePathpointing to a dedicated temp directory for unpacked files. This is only created when the/unpackendpoint is enabled (default). This enables the/unpack/allendpoint to return embedded files as a ZIP.
Both temp directories are cleaned up on server shutdown.
If a user config file does not include plugin-roots, the server automatically adds a default value pointing to a plugins directory in the current working directory.