Serialization in Tika 4.x

This document describes the JSON serialization design and implementation details for Apache Tika 4.x.

High-Level Goals

Jackson Framework Integration

Use Jackson as much as possible with as few custom serializers and as few annotations as possible. Jackson dependencies are kept out of core modules to maintain flexibility.

Friendly Naming Conventions

Implementation uses friendly names like pdf-parser rather than full class names. These friendly names are applied to configured items rather than configuration class names.

Custom Class Support

The design permits users to add custom classes through Jackson’s polymorphic handling:

  • org.apache.tika patterns are allowed by default

  • Users can define additional inclusion patterns for security

Configuration Consistency

The approach seeks to make initialization and runtime configuration look exactly the same and use the same underlying code where possible. However, security constraints may require differences in which fields are modifiable at runtime.

Configuration Objects Over Annotations

Preference for config objects rather than field annotations to support multithreading. Parsers retrieve settings from ParseContext at runtime.

Cross-System Configuration Flow

Configuration must pass seamlessly from:

  1. User clients

  2. Through tika-server REST APIs

  3. Into tika-pipes infrastructure

Initialization Structure

Tier 1 Objects

ID Objects

Fetchers, emitters - components with unique identifiers

Composite Objects

Parsers, detectors - components that aggregate other components

Single Objects

Pipes, gRPC, server configurations

Tier 2 Objects

Components that can be read via friendly names using @TikaComponent annotations in an other-config section.

Runtime Patterns

Backwards Compatibility

The design maintains backwards compatibility by allowing ParseContext additions where the interface serves as the key.

Partial Configuration Updates

Users can specify only updates to the initialization configuration through partial JSON objects, rather than requiring complete configuration documents.

Self-Configuring Components in Pipes

In the pipes infrastructure, objects should configure themselves to avoid classloading dependencies on components like PDFParser.

Security Considerations

  • Configuration files at initialization are treated as trusted sources

  • Runtime serialization/deserialization uses an allowlist of permitted packages

  • Custom components can register patterns in META-INF/tika-serialization-allowlist.txt

See Design Notes for 4.x for additional architectural context.