Migrating to Tika 4.x

This guide covers the changes required when upgrading from Apache Tika 3.x to 4.x.

See the Roadmap for version timelines and support schedules.

Requirements

  • Java 17 or later (upgraded from Java 11 in 3.x)

Configuration: XML to JSON

Tika 4.x uses JSON configuration files instead of XML. The legacy tika-config.xml format is no longer supported.

Automatic Conversion

Tika provides a conversion tool in tika-app to help migrate your XML configuration:

java -jar tika-app.jar --convert-config-xml-to-json=tika-config.xml,tika-config.json

The converter currently supports:

  • Parsers section - parser declarations with parameters and exclusions

  • Parameter types - bool, int, long, double, float, string, list, and map

  • Special handling - TesseractOCR’s otherTesseractSettings list is automatically converted to the otherTesseractConfig map format

Example Conversion

XML Format (3.x):

<properties>
  <parsers>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="sortByPosition" type="bool">true</param>
        <param name="maxMainMemoryBytes" type="long">1000000</param>
      </params>
    </parser>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
    </parser>
  </parsers>
</properties>

JSON Format (4.x):

{
  "parsers": [
    {
      "pdf-parser": {
        "sortByPosition": true,
        "maxMainMemoryBytes": 1000000
      }
    }
  ]
}
When you configure a parser with specific settings in JSON, the loader automatically excludes it from SPI loading. The parser (e.g., pdf-parser) is not even instantiated in default-parser if there’s a definition for it in the tika-config.json. Explicit _exclude directives are only needed when you want to disable a parser entirely without providing custom configuration.

Key Differences

Aspect XML (3.x) JSON (4.x)

Class references

Full class name (org.apache.tika.parser.pdf.PDFParser)

Kebab-case component name (pdf-parser)

Parameters

<param name="…​" type="…​">value</param>

Direct key-value pairs

Exclusions

<parser-exclude class="…​"/>

"_exclude": ["component-name"] (only needed to disable a parser entirely)

Limitations

The automatic converter has some limitations:

  • Only the parsers section is currently converted

  • Detectors and other sections require manual migration

  • Custom or third-party parsers not in the registry will use kebab-case name conversion

Parser Configuration Changes

The configuration options for PDFParser and TesseractOCRParser have changed significantly in 4.x. The automatic converter will migrate your parameter names, but you should review the updated documentation to ensure your configuration is optimal.

See the Configuration section for full details, including:

For the general serialization model and how JSON configuration works, see Serialization and Configuration.

Full Configuration Example

Below is a complete example of a Tika 4.x JSON configuration file with commonly configured parsers:

{
  "parsers": [
    {
      "pdf-parser": {
        "extractInlineImages": true,
        "extractUniqueInlineImagesOnly": true,
        "sortByPosition": true,
        "maxMainMemoryBytes": 1000000000
      }
    },
    {
      "tesseract-ocr-parser": {
        "language": "eng+fra",
        "pageSegMode": "1",
        "timeoutSeconds": 300,
        "otherTesseractConfig": {
          "textord_initialx_ile": "0.75",
          "textord_noise_hfract": "0.15625"
        }
      }
    },
    {
      "default-parser": {}
    }
  ]
}
This example shows common options. See the individual parser configuration pages for complete documentation of all available options.

Metadata Key Changes

Tika 4.x prefixes all "user generated" metadata keys to prevent overwrites and improve namespace clarity.

See Metadata Changes in 4.x for complete details, including a full table of changes and code migration examples.

API Changes

TikaConfig replaced by TikaLoader

TikaConfig has been removed. Use TikaLoader from tika-serialization instead.

3.x:

TikaConfig config = new TikaConfig(getClass().getClassLoader());
Parser parser = config.getParser();
Detector detector = config.getDetector();
AutoDetectParser autoDetect = new AutoDetectParser(config);

4.x:

// Default configuration (SPI-discovered components)
TikaLoader loader = TikaLoader.loadDefault(getClass().getClassLoader());

// Or from a JSON config file
TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));

// Access components
Parser parser = loader.loadParsers();
Detector detector = loader.loadDetectors();
Parser autoDetect = loader.loadAutoDetectParser();
ParseContext context = loader.loadParseContext();
TikaLoader is in the tika-serialization module. Add tika-serialization as a dependency if you were previously only depending on tika-core. See Serialization and Configuration for the full TikaLoader API.

For simple use cases, the Tika facade and DefaultParser still work without TikaLoader:

// Simple facade (unchanged from 3.x)
Tika tika = new Tika();
String text = tika.parseToString(file);

// Direct parser use (unchanged from 3.x)
Parser parser = new DefaultParser();

ExternalParser

The legacy ExternalParser and CompositeExternalParser have been removed. External parsers must now be explicitly configured via JSON. See External Parser Configuration for details.

Deprecations and Removals

  • TikaConfig — replaced by TikaLoader

  • CompositeExternalParser — external parsers now require explicit JSON configuration

  • ExternalParsersFactory and XML-based external parser auto-discovery

  • DOM-based OOXML extractors (XWPFWordExtractorDecorator, XSLFPowerPointExtractorDecorator) — SAX-based extractors are now the only implementation