Design Notes for Tika 4.x
This document captures the design decisions and architectural changes in Apache Tika 4.x.
Metadata Keys
The design addresses security concerns by implementing namespaced metadata keys. This prevents user-controlled data from potentially overwriting existing metadata values in the Metadata object.
See Migrating to Tika 4.x for details on specific metadata key changes.
Fat Jars and Maven Shade Strategy
Tika 4.x moves away from fat jar/shaded artifacts. The tika-app and tika-server now use
separate lib and plugins directories alongside the jar file, enabling standard java -jar
execution.
Plugins and PF4J Framework
Plugin Packaging
PF4J plugins are packaged exclusively as zips (not jars) to align with the move away from fat jars. Custom code addresses race conditions during the unzipping process across threads and processes.
Classloader Management
The team disabled PF4J’s default classpath loading to avoid complexity in unit tests. A configured plugins directory is now required.
This strict boundary prevents issues when components are loaded separately. For example, JSON
strings replace JsonNode objects to avoid problems with independent Jackson loading in plugins.
| We tried to have as few Tika dependencies in the plugins as possible. |
Serialization Architecture
Design Principles
-
Maximize Jackson usage while minimizing custom serialization code
-
Exclude Jackson from
tika-coreandtika-parsers-standard-modulesdependencies -
Enable runtime configuration updates via Jackson’s
readerForUpdating
Security Model
Configuration files at initialization are treated as trusted sources. Runtime
serialization/deserialization uses an allowlist of permitted packages via
PolymorphicObjectMapperFactory.
Custom components can add patterns to META-INF/tika-serialization-allowlist.txt.
Implementation Challenges
-
Converted code to true Java beans with matching getters/setters
-
Used
ObjectMapper.DefaultTyping.OBJECT_AND_NON_CONCRETEfor polymorphic typing -
Replaced generic collections (
List,Set) with concrete types (ArrayList,HashSet) -
Converted
Pathfields toStringdue to Jackson constraints -
Avoided Java records to enable
readerForUpdatingfunctionality
Annotations System
The @TikaComponent annotation handles:
-
Automatic service file generation at build time
-
Creation of
META-INF/tika/*.idxmapping files -
Kebab-case conversion of class names to friendly identifiers (e.g.,
PDFParser→pdf-parser) -
Manual name overrides via
nameattribute -
Optional
spi=falsesetting for non-service-file registration
Migration Strategy
The plan is to stabilize 4.x structures before backporting capabilities to 3.x and deprecating
TikaConfig and tika-config.xml.
A converter tool for transforming tika-config.xml to tika-config.json is planned, with
support focused on components in tika-parsers-standard-modules.
Development Tips
Outstanding Tasks
-
Implement flexible component loading without
@TikaComponentrequirements -
Enable friendly name usage throughout the codebase
-
Resolve gRPC issues
-
Fix external renderer byte-passing in open containers
-
Simplify and strengthen serialization code
-
Consider relocating
TikaConfigandForkParserto legacy module