Serialization in Tika 4.x
This document describes the JSON serialization design and implementation details for Apache Tika 4.x.
High-Level Goals
Jackson Framework Integration
Use Jackson as much as possible with as few custom serializers and as few annotations as possible. Jackson dependencies are kept out of core modules to maintain flexibility.
Friendly Naming Conventions
Implementation uses friendly names like pdf-parser rather than full class names. These friendly
names are applied to configured items rather than configuration class names.
Discovering the friendly name for a component
The 4.x JSON config refers to parsers, detectors, fetchers, emitters, and other components
by their friendly name (e.g., pdf-parser, file-system-fetcher). To map a Java class
to its friendly name (or vice versa), use any of:
-
tika-app --list-parser-names/--list-detector-names— emits each registered class with its friendly name as tab-separatedclass<TAB>friendly-name:java -jar tika-app.jar --list-parser-names # org.apache.tika.parser.pdf.PDFParser pdf-parser # org.apache.tika.parser.html.JSoupParser jsoup-parser # ...The mapping comes from the
META-INF/tika/parsers.idx/detectors.idxfiles generated at compile time by the@TikaComponentannotation processor. The underlying lookup iso.a.t.config.loader.ComponentRegistry.getFriendlyName(Class). -
Per-parser configuration pages under Configuration show the friendly name in their page title and JSON examples.
-
The naming convention — when
@TikaComponenthas no explicitname, the friendly name is derived from the class’s simple name via the kebab-case rule ino.a.t.config.loader.KebabCaseConverter. Examples:Class Friendly name PDFParserpdf-parserTesseractOCRParsertesseract-ocr-parserAutoDetectParserauto-detect-parserFileSystemFetcherfile-system-fetcherSolrEmittersolr-emitter
The --list-parsers, --list-detectors, and --list-parser-details commands
print the hierarchical, human-oriented view (class names with composite parsers
indented). Use the --list-*-names variants when you want a machine-readable mapping.
|
Custom Class Support
The design permits users to add custom classes through Jackson’s polymorphic handling:
-
org.apache.tikapatterns are allowed by default -
Users can define additional inclusion patterns for security
Configuration Consistency
The approach seeks to make initialization and runtime configuration look exactly the same and use the same underlying code where possible. However, security constraints may require differences in which fields are modifiable at runtime.
Initialization Structure
Runtime Patterns
Backwards Compatibility
The design maintains backwards compatibility by allowing ParseContext additions where the
interface serves as the key.
Security Considerations
-
Configuration files at initialization are treated as trusted sources
-
Runtime serialization/deserialization uses an allowlist of permitted packages
-
Custom components can register patterns in
META-INF/tika-serialization-allowlist.txt
See Design Notes for 4.x for additional architectural context.