Migrating to Tika 4.x
This guide covers the changes required when upgrading from Apache Tika 3.x to 4.x.
See the Roadmap for version timelines and support schedules.
Configuration: XML to JSON
Tika 4.x uses JSON configuration files instead of XML. The legacy tika-config.xml format
is no longer supported.
Automatic Conversion
Tika provides a conversion tool in tika-app to help migrate your XML configuration:
java -jar tika-app.jar --convert-config-xml-to-json=tika-config.xml,tika-config.json
The converter currently supports:
-
Parsers section - parser declarations with parameters and exclusions
-
Parameter types - bool, int, long, double, float, string, list, and map
-
Special handling - TesseractOCR’s
otherTesseractSettingslist is automatically converted to theotherTesseractConfigmap format
Example Conversion
XML Format (3.x):
<properties>
<parsers>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="sortByPosition" type="bool">true</param>
<param name="maxMainMemoryBytes" type="long">1000000</param>
</params>
</parser>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
</parser>
</parsers>
</properties>
JSON Format (4.x):
{
"parsers": [
{
"pdf-parser": {
"sortByPosition": true,
"maxMainMemoryBytes": 1000000
}
}
]
}
When you configure a parser with specific settings in JSON, the loader automatically
excludes it from SPI loading. The parser (e.g., pdf-parser) is not even instantiated in
default-parser if there’s a definition for it in the tika-config.json. Explicit _exclude
directives are only needed when you want to disable a parser entirely without providing
custom configuration.
|
Key Differences
| Aspect | XML (3.x) | JSON (4.x) |
|---|---|---|
Class references |
Full class name ( |
Kebab-case component name ( |
Parameters |
|
Direct key-value pairs |
Exclusions |
|
|
Limitations
The automatic converter has some limitations:
-
Only the
parserssection is currently converted -
Detectors and other sections require manual migration
-
Custom or third-party parsers not in the registry will use kebab-case name conversion
Parser Configuration Changes
The configuration options for PDFParser and TesseractOCRParser have changed
significantly in 4.x. The automatic converter will migrate your parameter names, but you
should review the updated documentation to ensure your configuration is optimal.
|
See the Configuration section for full details, including:
For the general serialization model and how JSON configuration works, see Serialization and Configuration.
Full Configuration Example
Below is a complete example of a Tika 4.x JSON configuration file with commonly configured parsers:
{
"parsers": [
{
"pdf-parser": {
"extractInlineImages": true,
"extractUniqueInlineImagesOnly": true,
"sortByPosition": true,
"maxMainMemoryBytes": 1000000000
}
},
{
"tesseract-ocr-parser": {
"language": "eng+fra",
"pageSegMode": "1",
"timeoutSeconds": 300,
"otherTesseractConfig": {
"textord_initialx_ile": "0.75",
"textord_noise_hfract": "0.15625"
}
}
},
{
"default-parser": {}
}
]
}
| This example shows common options. See the individual parser configuration pages for complete documentation of all available options. |
Metadata Key Changes
Tika 4.x prefixes all "user generated" metadata keys to prevent overwrites and improve namespace clarity.
See Metadata Changes in 4.x for complete details, including a full table of changes and code migration examples.
API Changes
TikaConfig replaced by TikaLoader
TikaConfig has been removed. Use TikaLoader from tika-serialization instead.
3.x:
TikaConfig config = new TikaConfig(getClass().getClassLoader());
Parser parser = config.getParser();
Detector detector = config.getDetector();
AutoDetectParser autoDetect = new AutoDetectParser(config);
4.x:
// Default configuration (SPI-discovered components)
TikaLoader loader = TikaLoader.loadDefault(getClass().getClassLoader());
// Or from a JSON config file
TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));
// Access components
Parser parser = loader.loadParsers();
Detector detector = loader.loadDetectors();
Parser autoDetect = loader.loadAutoDetectParser();
ParseContext context = loader.loadParseContext();
TikaLoader is in the tika-serialization module. Add tika-serialization
as a dependency if you were previously only depending on tika-core.
See Serialization and Configuration for
the full TikaLoader API.
|
For simple use cases, the Tika facade and DefaultParser still work without
TikaLoader:
// Simple facade (unchanged from 3.x)
Tika tika = new Tika();
String text = tika.parseToString(file);
// Direct parser use (unchanged from 3.x)
Parser parser = new DefaultParser();
ExternalParser
The legacy ExternalParser and CompositeExternalParser have been removed.
External parsers must now be explicitly configured via JSON. See
External Parser Configuration
for details.
Deprecations and Removals
-
TikaConfig— replaced byTikaLoader -
CompositeExternalParser— external parsers now require explicit JSON configuration -
ExternalParsersFactoryand XML-based external parser auto-discovery -
DOM-based OOXML extractors (
XWPFWordExtractorDecorator,XSLFPowerPointExtractorDecorator) — SAX-based extractors are now the only implementation