Serialization in Tika 4.x

This document describes the JSON serialization design and implementation details for Apache Tika 4.x.

High-Level Goals

Jackson Framework Integration

Use Jackson as much as possible with as few custom serializers and as few annotations as possible. Jackson dependencies are kept out of core modules to maintain flexibility.

Friendly Naming Conventions

Implementation uses friendly names like pdf-parser rather than full class names. These friendly names are applied to configured items rather than configuration class names.

Discovering the friendly name for a component

The 4.x JSON config refers to parsers, detectors, fetchers, emitters, and other components by their friendly name (e.g., pdf-parser, file-system-fetcher). To map a Java class to its friendly name (or vice versa), use any of:

  1. tika-app --list-parser-names / --list-detector-names — emits each registered class with its friendly name as tab-separated class<TAB>friendly-name:

    java -jar tika-app.jar --list-parser-names
    # org.apache.tika.parser.pdf.PDFParser     pdf-parser
    # org.apache.tika.parser.html.JSoupParser  jsoup-parser
    # ...

    The mapping comes from the META-INF/tika/parsers.idx / detectors.idx files generated at compile time by the @TikaComponent annotation processor. The underlying lookup is o.a.t.config.loader.ComponentRegistry.getFriendlyName(Class).

  2. Per-parser configuration pages under Configuration show the friendly name in their page title and JSON examples.

  3. The naming convention — when @TikaComponent has no explicit name, the friendly name is derived from the class’s simple name via the kebab-case rule in o.a.t.config.loader.KebabCaseConverter. Examples:

    Class Friendly name

    PDFParser

    pdf-parser

    TesseractOCRParser

    tesseract-ocr-parser

    AutoDetectParser

    auto-detect-parser

    FileSystemFetcher

    file-system-fetcher

    SolrEmitter

    solr-emitter

The --list-parsers, --list-detectors, and --list-parser-details commands print the hierarchical, human-oriented view (class names with composite parsers indented). Use the --list-*-names variants when you want a machine-readable mapping.

Custom Class Support

The design permits users to add custom classes through Jackson’s polymorphic handling:

  • org.apache.tika patterns are allowed by default

  • Users can define additional inclusion patterns for security

Configuration Consistency

The approach seeks to make initialization and runtime configuration look exactly the same and use the same underlying code where possible. However, security constraints may require differences in which fields are modifiable at runtime.

Configuration Objects Over Annotations

Preference for config objects rather than field annotations to support multithreading. Parsers retrieve settings from ParseContext at runtime.

Cross-System Configuration Flow

Configuration must pass seamlessly from:

  1. User clients

  2. Through tika-server REST APIs

  3. Into tika-pipes infrastructure

Initialization Structure

Tier 1 Objects

ID Objects

Fetchers, emitters - components with unique identifiers

Composite Objects

Parsers, detectors - components that aggregate other components

Single Objects

Pipes, gRPC, server configurations

Tier 2 Objects

Components that can be read via friendly names using @TikaComponent annotations in an other-config section.

Runtime Patterns

Backwards Compatibility

The design maintains backwards compatibility by allowing ParseContext additions where the interface serves as the key.

Partial Configuration Updates

Users can specify only updates to the initialization configuration through partial JSON objects, rather than requiring complete configuration documents.

Self-Configuring Components in Pipes

In the pipes infrastructure, objects should configure themselves to avoid classloading dependencies on components like PDFParser.

Security Considerations

  • Configuration files at initialization are treated as trusted sources

  • Runtime serialization/deserialization uses an allowlist of permitted packages

  • Custom components can register patterns in META-INF/tika-serialization-allowlist.txt

See Design Notes for 4.x for additional architectural context.