Class TikaLoader
Usage:
TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));
Parser parser = loader.loadParsers();
Detector detector = loader.loadDetectors();
JSON configuration format:
{
"parsers": [
{
"pdf-parser": {
"_mime-include": ["application/pdf"],
"_mime-exclude": ["application/pdf+fdf"],
"ocrStrategy": "AUTO",
"extractInlineImages": true
}
}
],
"detectors": [
{ "mime-magic-detector": { ... } }
]
}
-
Method Summary
Modifier and TypeMethodDescription<T> TGets a component by its class type.<T> TGets a component by its JSON field name.Gets the class loader used for loading components.Gets the underlying JSON configuration.Gets the global settings if they have been loaded.static MediaTypeRegistryGets the media type registry.static MimeTypesstatic TikaLoaderLoads a Tika configuration from a file.static TikaLoaderload(Path configPath, ClassLoader classLoader) Loads a Tika configuration from a file with a specific class loader.Loads and returns an AutoDetectParser configured with this loader's parsers and detectors.<T> TloadConfig(Class<T> clazz, T defaults) Loads a configuration object from the "parse-context" section, merging with defaults.<T> TloadConfig(String key, Class<T> clazz, T defaults) Loads a configuration object from the "parse-context" section by explicit key, merging with defaults.Loads and returns the content handler factory.static TikaLoaderCreates a default Tika loader with no configuration file.static TikaLoaderloadDefault(ClassLoader classLoader) Creates a default Tika loader with no configuration file and a specific class loader.Loads and returns all detectors.Loads and returns all encoding detectors.Loads global configuration settings from the JSON config.Loads and returns all metadata filters.Loads and returns a ParseContext populated with components from the "parse-context" section.Loads and returns all parsers.Loads and returns all renderers.Loads and returns the translator.voidSaves the current configuration to a JSON file (pretty-printed).voidsave(OutputStream outputStream) Saves the current configuration to an output stream (pretty-printed).toJson()Converts the current configuration to a JSON string (pretty-printed).
-
Method Details
-
load
Loads a Tika configuration from a file. Global settings are automatically loaded and applied during initialization.- Parameters:
configPath- the path to the JSON configuration file- Returns:
- the Tika loader
- Throws:
TikaConfigException- if loading or parsing failsIOException
-
load
public static TikaLoader load(Path configPath, ClassLoader classLoader) throws TikaConfigException, IOException Loads a Tika configuration from a file with a specific class loader. Global settings are automatically loaded and applied during initialization.- Parameters:
configPath- the path to the JSON configuration fileclassLoader- the class loader to use for loading components- Returns:
- the Tika loader
- Throws:
TikaConfigException- if loading or parsing failsIOException
-
loadDefault
Creates a default Tika loader with no configuration file. All components (parsers, detectors, etc.) will be loaded from SPI. Returns a cached instance if already created.- Returns:
- the Tika loader
-
loadDefault
Creates a default Tika loader with no configuration file and a specific class loader. All components (parsers, detectors, etc.) will be loaded from SPI.- Parameters:
classLoader- the class loader to use for loading components- Returns:
- the Tika loader
-
loadParsers
Loads and returns all parsers. Syntactic sugar forget(Parser.class). Results are cached - subsequent calls return the same instance.- Returns:
- the parser (typically a CompositeParser internally)
- Throws:
TikaConfigException- if loading fails
-
loadDetectors
Loads and returns all detectors. Syntactic sugar forget(Detector.class). Results are cached - subsequent calls return the same instance.- Returns:
- the detector (typically a CompositeDetector internally)
- Throws:
TikaConfigException- if loading fails
-
loadEncodingDetectors
Loads and returns all encoding detectors. Syntactic sugar forget(EncodingDetector.class). Results are cached - subsequent calls return the same instance.- Returns:
- the encoding detector (typically a CompositeEncodingDetector internally)
- Throws:
TikaConfigException- if loading fails
-
loadMetadataFilters
Loads and returns all metadata filters. Syntactic sugar forget(MetadataFilter.class). Results are cached - subsequent calls return the same instance.- Returns:
- the metadata filter (typically a CompositeMetadataFilter internally)
- Throws:
TikaConfigException- if loading fails
-
loadContentHandlerFactory
Loads and returns the content handler factory. If "content-handler-factory" section exists in config, uses that factory. If section missing, returns a default BasicContentHandlerFactory with MARKDOWN handler. Results are cached - subsequent calls return the same instance.Example JSON:
{ "content-handler-factory": { "basic-content-handler-factory": { "type": "HTML", "writeLimit": 100000 } } }- Returns:
- the content handler factory
- Throws:
TikaConfigException- if loading fails
-
loadRenderers
Loads and returns all renderers. Syntactic sugar forget(Renderer.class). Results are cached - subsequent calls return the same instance.- Returns:
- the renderer (typically a CompositeRenderer internally)
- Throws:
TikaConfigException- if loading fails
-
loadTranslator
Loads and returns the translator. Syntactic sugar forget(Translator.class). Results are cached - subsequent calls return the same instance.- Returns:
- the translator
- Throws:
TikaConfigException- if loading fails
-
loadAutoDetectParser
Loads and returns an AutoDetectParser configured with this loader's parsers and detectors. Results are cached - subsequent calls return the same instance.- Returns:
- the auto-detect parser
- Throws:
TikaConfigException- if loading failsIOException- if loading AutoDetectParserConfig fails
-
loadParseContext
Loads and returns a ParseContext populated with components from the "parse-context" section.This method deserializes the parse-context JSON and resolves all component references using the component registry. Components are looked up by their friendly names (e.g., "embedded-limits", "pdf-parser-config") and deserialized to their appropriate types.
Use this method when you need a pre-configured ParseContext for parsing operations.
Example usage:
TikaLoader loader = TikaLoader.load(configPath); Parser parser = loader.loadAutoDetectParser(); ParseContext context = loader.loadParseContext(); Metadata metadata = Metadata.newInstance(context); parser.parse(stream, handler, metadata, context);
- Returns:
- a ParseContext populated with configured components
- Throws:
TikaConfigException- if loading fails
-
loadConfig
Loads a configuration object from the "parse-context" section, merging with defaults.This method is useful when you have a base configuration (e.g., from code defaults or a previous load) and want to overlay values from the JSON config. Properties not specified in the JSON retain their default values.
The original defaults object is NOT modified - a new instance is returned.
Example usage for PDFParserConfig:
// Load base config from tika-config.json at init time TikaLoader loader = TikaLoader.load(configPath); PDFParserConfig baseConfig = loader.loadConfig(PDFParserConfig.class, new PDFParserConfig()); // At runtime, create per-request overrides PDFParserConfig requestConfig = new PDFParserConfig(); requestConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR); // Merge: base config values + request overrides // (Note: for runtime merging, use JsonMergeUtils directly or loadConfig on a runtime loader)
- Type Parameters:
T- the configuration type- Parameters:
clazz- the class to deserialize intodefaults- the default values to use for properties not in the JSON config- Returns:
- a new instance with defaults merged with JSON config, or the original defaults if not configured
- Throws:
TikaConfigException- if loading fails
-
loadConfig
Loads a configuration object from the "parse-context" section by explicit key, merging with defaults.This method is useful when the JSON key doesn't match the class name's kebab-case conversion, or when you want to load from a specific key.
- Type Parameters:
T- the configuration type- Parameters:
key- the JSON key in the "parse-context" sectionclazz- the class to deserialize intodefaults- the default values to use for properties not in the JSON config- Returns:
- a new instance with defaults merged with JSON config, or the original defaults if not configured
- Throws:
TikaConfigException- if loading fails
-
getConfig
Gets the underlying JSON configuration.- Returns:
- the JSON configuration
-
getClassLoader
Gets the class loader used for loading components.- Returns:
- the class loader
-
getMediaTypeRegistry
Gets the media type registry. Lazily loads the default registry if not already set. This is a static singleton shared across all TikaLoader instances.- Returns:
- the media type registry
-
getMimeTypes
-
loadGlobalSettings
Loads global configuration settings from the JSON config. These settings are applied to Tika's static configuration when loaded.Settings include:
- metadata-list - Jackson StreamReadConstraints for JsonMetadata/JsonMetadataList serialization
- service-loader - Service loader configuration
- xml-reader-utils - XML parser security settings
Example JSON:
{ "metadata-list": { "maxStringLength": 50000000, "maxNestingDepth": 10, "maxNumberLength": 500 }, "xml-reader-utils": { "maxEntityExpansions": 1000, "maxNumReuses": 100, "poolSize": 10 } }- Returns:
- the global settings, or an empty object if no settings are configured
- Throws:
TikaConfigException- if loading failsIOException
-
getGlobalSettings
Gets the global settings if they have been loaded.- Returns:
- the global settings, or null if not yet loaded
-
get
Gets a component by its class type. Components are loaded lazily and cached.- Parameters:
componentClass- the component class (e.g., Parser.class, Detector.class)- Returns:
- the loaded component
- Throws:
TikaConfigException- if loading fails
-
get
Gets a component by its JSON field name. Components are loaded lazily and cached.- Parameters:
jsonField- the JSON field name (e.g., "parsers", "detectors")- Returns:
- the loaded component
- Throws:
TikaConfigException- if loading fails
-
save
Saves the current configuration to a JSON file (pretty-printed).- Throws:
IOException
-
save
Saves the current configuration to an output stream (pretty-printed).- Throws:
IOException
-
toJson
Converts the current configuration to a JSON string (pretty-printed).- Throws:
IOException
-