Tika-App Integration Testing
- Setup
- Test Cases
- Test 1: Basic Text Extraction
- Test 2: Metadata Extraction
- Test 3: JSON Output with Pretty Print
- Test 4: File Type Detection
- Test 5: Non-existent File Handling
- Test 6: Recursive JSON Output
- Test 7: Stdin Input
- Test 8: Extract Attachments (-z)
- Test 9: Recursive Extract (-Z)
- Test 10: Batch Mode (Simple)
- Test 10b: Batch Mode with Output Options
- Test 11: Version Check
- Test 12: List Parsers
- Test 13: Language Detection
- Test 14: Digest Computation
- Test 15: URL Input
- Test 16: XMP Output
- Test 17: Boilerpipe Main Content
- Test 18: Depth Limiting
- Test 19: GUI Mode
- Advanced Tests: Custom Config
- Known Issues
Integration tests for tika-app to be run from a distribution ZIP.
Setup
# Create test directory
mkdir -p /tmp/tika-app-test
cd /tmp/tika-app-test
# Copy and extract distribution
cp /path/to/tika-app-4.0.0-SNAPSHOT.zip .
unzip tika-app-4.0.0-SNAPSHOT.zip
cd tika-app-4.0.0-SNAPSHOT
# Get test files
cp /path/to/tika-main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testPDF.pdf .
cp /path/to/tika-main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/test_recursive_embedded.docx .
cp /path/to/tika-main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testHTML.html .
Test Cases
Test 1: Basic Text Extraction
java -jar tika-app.jar --text testPDF.pdf
Expected: Outputs extracted text from PDF.
Test 2: Metadata Extraction
java -jar tika-app.jar --metadata testPDF.pdf
Expected: Outputs key=value metadata pairs.
Test 3: JSON Output with Pretty Print
java -jar tika-app.jar --json --pretty-print testPDF.pdf
Expected: Clean, readable JSON output with metadata.
Test 4: File Type Detection
java -jar tika-app.jar --detect testPDF.pdf
Expected: Returns application/pdf
Test 5: Non-existent File Handling
java -jar tika-app.jar --text nonexistent_file.pdf
Expected: Clear error message (currently shows confusing "MalformedURLException: no protocol").
Test 6: Recursive JSON Output
java -jar tika-app.jar --jsonRecursive test_recursive_embedded.docx
Expected: JSON array with metadata and content for main doc and all embedded documents.
Test 7: Stdin Input
echo "Hello World" | java -jar tika-app.jar --text
Expected: Outputs "Hello World"
Test 8: Extract Attachments (-z)
mkdir -p /tmp/tika-app-test/extract-out
java -jar tika-app.jar -z --extract-dir=/tmp/tika-app-test/extract-out test_recursive_embedded.docx
ls /tmp/tika-app-test/extract-out
Expected: Creates .json metadata file and extracts embedded files to extract-out directory.
Test 9: Recursive Extract (-Z)
mkdir -p /tmp/tika-app-test/extract-recursive
java -jar tika-app.jar -Z --extract-dir=/tmp/tika-app-test/extract-recursive test_recursive_embedded.docx
ls -R /tmp/tika-app-test/extract-recursive
Expected: Extracts all nested embedded documents recursively.
Test 10: Batch Mode (Simple)
mkdir -p /tmp/tika-app-test/batch-input
mkdir -p /tmp/tika-app-test/batch-output
cp testPDF.pdf testHTML.html /tmp/tika-app-test/batch-input/
java -jar tika-app.jar /tmp/tika-app-test/batch-input /tmp/tika-app-test/batch-output
ls /tmp/tika-app-test/batch-output
Expected: Creates .json files for each input file in output directory.
Test 10b: Batch Mode with Output Options
mkdir -p /tmp/tika-app-test/batch-output2
java -jar tika-app.jar -J -t /tmp/tika-app-test/batch-input /tmp/tika-app-test/batch-output2
ls /tmp/tika-app-test/batch-output2
Expected: Creates .json files with text content (X-TIKA:content_handler should be ToTextContentHandler).
Test 12: List Parsers
java -jar tika-app.jar --list-parsers
Expected: Hierarchical list of available parsers.
Test 13: Language Detection
java -jar tika-app.jar --language testPDF.pdf
Expected: Returns detected language code.
Test 14: Digest Computation
java -jar tika-app.jar --digest=md5 --json testPDF.pdf
Expected: JSON output includes X-TIKA:digest:MD5 field.
Test 15: URL Input
java -jar tika-app.jar --detect https://www.apache.org/
Expected: Returns text/html
Test 16: XMP Output
java -jar tika-app.jar --xmp testPDF.pdf
Expected: Valid XMP metadata in RDF/XML format.
Test 17: Boilerpipe Main Content
java -jar tika-app.jar --text-main testHTML.html
Expected: Returns only main content, not boilerplate.
Advanced Tests: Custom Config
These tests require creating a custom tika-config.json file.
Test 20: Create Custom Config File
Create /tmp/tika-app-test/my-config.json:
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": 100000,
"throwOnWriteLimitReached": false
}
},
"parsers": [
{
"default-parser": {}
},
{
"pdf-parser": {
"extractActions": true,
"extractInlineImages": true,
"ocrStrategy": "NO_OCR"
}
},
{
"ooxml-parser": {
"includeDeletedContent": true,
"includeMoveFromContent": true,
"extractMacros": true
}
}
],
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "/tmp/tika-app-test/batch-input",
"extractFileSystemMetadata": true
}
}
},
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "/tmp/tika-app-test/config-output",
"fileExtension": "json",
"onExists": "REPLACE"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/tmp/tika-app-test/batch-input",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
},
"pipes": {
"parseMode": "RMETA",
"numClients": 2,
"timeoutMillis": 60000
},
"plugin-roots": "/tmp/tika-app-test/plugins"
}
Test 21: Run with Custom Config
mkdir -p /tmp/tika-app-test/config-output
java -jar tika-app.jar /tmp/tika-app-test/my-config.json
ls /tmp/tika-app-test/config-output
Expected: Processes all files in batch-input using custom parser settings.
Test 22: Async Mode with Config Flag
java -jar tika-app.jar -a --config=/tmp/tika-app-test/my-config.json
Expected: Same as Test 21 but using explicit async flag.
Test 23: Unpack with Frictionless Format
mkdir -p /tmp/tika-app-test/frictionless-out
java -jar tika-app.jar -Z --extract-dir=/tmp/tika-app-test/frictionless-out --unpack-format=FRICTIONLESS --unpack-include-metadata test_recursive_embedded.docx
ls /tmp/tika-app-test/frictionless-out
Expected: Extracts embedded files in Frictionless data package format with metadata.json.
Test 24: Unpack to Directory (not zipped)
mkdir -p /tmp/tika-app-test/unpack-dir-out
java -jar tika-app.jar -Z --extract-dir=/tmp/tika-app-test/unpack-dir-out --unpack-mode=DIRECTORY test_recursive_embedded.docx
ls -R /tmp/tika-app-test/unpack-dir-out
Expected: Extracts embedded files to directory structure instead of zipped.
Test 25: Batch with Multiple Workers
mkdir -p /tmp/tika-app-test/multi-worker-out
java -jar tika-app.jar -n 4 /tmp/tika-app-test/batch-input /tmp/tika-app-test/multi-worker-out
Expected: Processes files using 4 parallel forked clients.
Known Issues
Issue 1: Confusing "no protocol" Error
When a file doesn’t exist, the error message is misleading:
MalformedURLException: no protocol: nonexistent_file.pdf
Should say "File not found".