Design Notes for Tika 4.x

This document captures the design decisions and architectural changes in Apache Tika 4.x.

Metadata Keys

The design addresses security concerns by implementing namespaced metadata keys. This prevents user-controlled data from potentially overwriting existing metadata values in the Metadata object.

See Migrating to Tika 4.x for details on specific metadata key changes.

Fat Jars and Maven Shade Strategy

Tika 4.x moves away from fat jar/shaded artifacts. The tika-app and tika-server now use separate lib and plugins directories alongside the jar file, enabling standard java -jar execution.

Plugins and PF4J Framework

Plugin Packaging

PF4J plugins are packaged exclusively as zips (not jars) to align with the move away from fat jars. Custom code addresses race conditions during the unzipping process across threads and processes.

Classloader Management

The team disabled PF4J’s default classpath loading to avoid complexity in unit tests. A configured plugins directory is now required.

This strict boundary prevents issues when components are loaded separately. For example, JSON strings replace JsonNode objects to avoid problems with independent Jackson loading in plugins.

We tried to have as few Tika dependencies in the plugins as possible.

Serialization Architecture

Design Principles

  • Maximize Jackson usage while minimizing custom serialization code

  • Exclude Jackson from tika-core and tika-parsers-standard-modules dependencies

  • Enable runtime configuration updates via Jackson’s readerForUpdating

Security Model

Configuration files at initialization are treated as trusted sources. Runtime serialization/deserialization uses an allowlist of permitted packages via PolymorphicObjectMapperFactory.

Custom components can add patterns to META-INF/tika-serialization-allowlist.txt.

Implementation Challenges

  • Converted code to true Java beans with matching getters/setters

  • Used ObjectMapper.DefaultTyping.OBJECT_AND_NON_CONCRETE for polymorphic typing

  • Replaced generic collections (List, Set) with concrete types (ArrayList, HashSet)

  • Converted Path fields to String due to Jackson constraints

  • Avoided Java records to enable readerForUpdating functionality

Annotations System

The @TikaComponent annotation handles:

  • Automatic service file generation at build time

  • Creation of META-INF/tika/*.idx mapping files

  • Kebab-case conversion of class names to friendly identifiers (e.g., PDFParserpdf-parser)

  • Manual name overrides via name attribute

  • Optional spi=false setting for non-service-file registration

Migration Strategy

The plan is to stabilize 4.x structures before backporting capabilities to 3.x and deprecating TikaConfig and tika-config.xml.

A converter tool for transforming tika-config.xml to tika-config.json is planned, with support focused on components in tika-parsers-standard-modules.

Development Tips

Common Issues

  • Plugin directories and @TikaComponent annotations becoming out of sync across modules

  • IntelliJ conflicts with command-line builds

  • Checkstyle running before Spotless, causing preventable failures

For faster builds during development:

mvn clean install -am -pl :tika-app -Pfast

To apply formatting and build:

mvn clean spotless:apply install

Outstanding Tasks

  • Implement flexible component loading without @TikaComponent requirements

  • Enable friendly name usage throughout the codebase

  • Resolve gRPC issues

  • Fix external renderer byte-passing in open containers

  • Simplify and strengthen serialization code

  • Consider relocating TikaConfig and ForkParser to legacy module