Serialization in Tika 4.x
This document describes the JSON serialization design and implementation details for Apache Tika 4.x.
High-Level Goals
Jackson Framework Integration
Use Jackson as much as possible with as few custom serializers and as few annotations as possible. Jackson dependencies are kept out of core modules to maintain flexibility.
Friendly Naming Conventions
Implementation uses friendly names like pdf-parser rather than full class names. These friendly
names are applied to configured items rather than configuration class names.
Custom Class Support
The design permits users to add custom classes through Jackson’s polymorphic handling:
-
org.apache.tikapatterns are allowed by default -
Users can define additional inclusion patterns for security
Configuration Consistency
The approach seeks to make initialization and runtime configuration look exactly the same and use the same underlying code where possible. However, security constraints may require differences in which fields are modifiable at runtime.
Initialization Structure
Runtime Patterns
Backwards Compatibility
The design maintains backwards compatibility by allowing ParseContext additions where the
interface serves as the key.
Security Considerations
-
Configuration files at initialization are treated as trusted sources
-
Runtime serialization/deserialization uses an allowlist of permitted packages
-
Custom components can register patterns in
META-INF/tika-serialization-allowlist.txt
See Design Notes for 4.x for additional architectural context.