Serialization and Configuration
Tika 4.x uses JSON-based configuration and serialization throughout the system. This document explains how the serialization system works and how to create components that integrate with it.
Overview
Tika’s serialization system provides:
-
JSON Configuration: Configure Tika components using JSON files
-
Friendly Names: Reference components by name (e.g.,
pdf-parser) instead of class names -
ParseContext Serialization: Send per-request configuration via
FetchEmitTuple -
Security: Only registered components can be instantiated from JSON
The system is built on Jackson with custom serializers/deserializers in the
tika-serialization module.
JSON Configuration Format
Tika uses a compact format for component configuration:
{
"auto-detect-parser": {
"throwOnZeroBytes": false
},
"parse-context": {
"commons-digester-factory": {
"digests": [
{ "algorithm": "MD5" },
{ "algorithm": "SHA256" }
]
}
}
}
Components can be specified as:
-
String:
"pdf-parser"- creates instance with defaults -
Object:
{"pdf-parser": {"ocrStrategy": "AUTO"}}- creates configured instance
The @TikaComponent Annotation
The @TikaComponent annotation is required for any class that should be
configurable via JSON. It serves multiple purposes:
-
Registration: Registers the class with a friendly name
-
Index Generation: Creates lookup files for name-to-class resolution
-
SPI Registration: Optionally registers for Java ServiceLoader
-
Security: Acts as an allowlist for deserialization
Basic Usage
@TikaComponent
public class MyCustomParser implements Parser {
// Parser implementation
}
This automatically:
-
Generates friendly name
my-custom-parserfrom the class name -
Adds to
META-INF/tika/parsers.idxfor name lookup -
Adds to
META-INF/services/org.apache.tika.parser.Parserfor SPI
Annotation Attributes
| Attribute | Default | Description |
|---|---|---|
|
(auto-generated) |
Custom friendly name instead of deriving from class name |
|
|
Whether to register in |
|
(auto-detected) |
Class to use as ParseContext key (rarely needed) |
|
(none) |
Marks as default implementation for an interface |
Context Key Detection
When storing components in ParseContext, Tika needs to know which class
to use as the lookup key. For example, CommonsDigesterFactory should be
retrievable via parseContext.get(DigesterFactory.class).
Automatic Detection
Tika automatically detects the context key by checking if your class implements one of these known interfaces:
-
Parser,Detector,EncodingDetector -
MetadataFilter,Translator,Renderer -
DigesterFactory,ContentHandlerFactory -
EmbeddedDocumentExtractorFactory,MetadataWriteLimiterFactory
@TikaComponent
public class CommonsDigesterFactory implements DigesterFactory {
// Context key automatically detected as DigesterFactory.class
}
Service Interface Categories
First-Class Service Interfaces
These are loaded via SPI and have dedicated index files:
| Interface | Index File |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ParseContext Components
Components not implementing first-class interfaces go to parse-context.idx:
-
DigesterFactory- Digest/checksum calculation -
ContentHandlerFactory- SAX content handler creation -
EmbeddedDocumentExtractorFactory- Embedded document handling -
MetadataWriteLimiterFactory- Metadata write limiting
Self-Configuring Components
Components implementing SelfConfiguring handle their own configuration
at runtime rather than during initial loading:
@TikaComponent
public class PDFParser extends AbstractParser implements SelfConfiguring {
private PDFParserConfig defaultConfig = new PDFParserConfig();
@Override
public void configure(ParseContext parseContext) {
PDFParserConfig config = ParseContextConfig.getConfig(
parseContext, "pdf-parser", PDFParserConfig.class, defaultConfig);
// Use config...
}
}
Benefits:
-
Per-request configuration via
ParseContext -
Lazy loading - config only parsed when needed
-
Merging with defaults handled automatically
ParseContext Serialization
ParseContext can be serialized to JSON for transmission (e.g., in FetchEmitTuple):
{
"parseContext": {
"pdf-parser": {
"ocrStrategy": "AUTO",
"extractInlineImages": true
},
"commons-digester-factory": {
"digests": [{"algorithm": "SHA256"}]
}
}
}
Security Model
The serialization system implements a security allowlist:
-
@TikaComponent Required: Only annotated classes are registered
-
Registry Lookup: Deserialization only instantiates registered classes
-
No Arbitrary Classes: Unknown class names cause errors, not instantiation
This prevents attacks where malicious JSON specifies dangerous classes for instantiation.
// This will FAIL - class not registered
{
"parse-context": {
"java.lang.Runtime": {} // Error: Unknown component
}
}
Creating a Custom Component
Complete example of a custom metadata filter:
package com.example.tika;
import org.apache.tika.config.TikaComponent;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.filter.MetadataFilter;
@TikaComponent
public class UpperCaseFilter implements MetadataFilter {
private String fieldName = "title";
public void setFieldName(String fieldName) {
this.fieldName = fieldName;
}
public String getFieldName() {
return fieldName;
}
@Override
public void filter(Metadata metadata) throws TikaException {
String value = metadata.get(fieldName);
if (value != null) {
metadata.set(fieldName, value.toUpperCase());
}
}
}
Configure in JSON:
{
"metadata-filters": [
{"upper-case-filter": {"fieldName": "dc:title"}}
]
}
Or with defaults:
{
"metadata-filters": ["upper-case-filter"]
}
Troubleshooting
"Unknown component name" Error
-
Ensure class has
@TikaComponentannotation -
Verify annotation processing ran during compilation
-
Check that
META-INF/tika/*.idxfile exists in JAR