Class TikaLoader

java.lang.Object
org.apache.tika.config.loader.TikaLoader

public class TikaLoader extends Object
Main entry point for loading Tika components from JSON configuration. Provides lazy loading of component types - only loads classes when requested.

Usage:

 TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));
 Parser parser = loader.loadParsers();
 Detector detector = loader.loadDetectors();
 

JSON configuration format:

 {
   "parsers": [
     {
       "pdf-parser": {
         "_mime-include": ["application/pdf"],
         "_mime-exclude": ["application/pdf+fdf"],
         "ocrStrategy": "AUTO",
         "extractInlineImages": true
       }
     }
   ],
   "detectors": [
     { "mime-magic-detector": { ... } }
   ]
 }
 
  • Method Details

    • load

      public static TikaLoader load(Path configPath) throws TikaConfigException, IOException
      Loads a Tika configuration from a file. Global settings are automatically loaded and applied during initialization.
      Parameters:
      configPath - the path to the JSON configuration file
      Returns:
      the Tika loader
      Throws:
      TikaConfigException - if loading or parsing fails
      IOException
    • load

      public static TikaLoader load(Path configPath, ClassLoader classLoader) throws TikaConfigException, IOException
      Loads a Tika configuration from a file with a specific class loader. Global settings are automatically loaded and applied during initialization.
      Parameters:
      configPath - the path to the JSON configuration file
      classLoader - the class loader to use for loading components
      Returns:
      the Tika loader
      Throws:
      TikaConfigException - if loading or parsing fails
      IOException
    • loadDefault

      public static TikaLoader loadDefault()
      Creates a default Tika loader with no configuration file. All components (parsers, detectors, etc.) will be loaded from SPI. Returns a cached instance if already created.
      Returns:
      the Tika loader
    • loadDefault

      public static TikaLoader loadDefault(ClassLoader classLoader)
      Creates a default Tika loader with no configuration file and a specific class loader. All components (parsers, detectors, etc.) will be loaded from SPI.
      Parameters:
      classLoader - the class loader to use for loading components
      Returns:
      the Tika loader
    • loadParsers

      public Parser loadParsers() throws TikaConfigException
      Loads and returns all parsers. Syntactic sugar for get(Parser.class). Results are cached - subsequent calls return the same instance.
      Returns:
      the parser (typically a CompositeParser internally)
      Throws:
      TikaConfigException - if loading fails
    • loadDetectors

      public Detector loadDetectors() throws TikaConfigException
      Loads and returns all detectors. Syntactic sugar for get(Detector.class). Results are cached - subsequent calls return the same instance.
      Returns:
      the detector (typically a CompositeDetector internally)
      Throws:
      TikaConfigException - if loading fails
    • loadEncodingDetectors

      public EncodingDetector loadEncodingDetectors() throws TikaConfigException
      Loads and returns all encoding detectors. Syntactic sugar for get(EncodingDetector.class). Results are cached - subsequent calls return the same instance.
      Returns:
      the encoding detector (typically a CompositeEncodingDetector internally)
      Throws:
      TikaConfigException - if loading fails
    • loadMetadataFilters

      public MetadataFilter loadMetadataFilters() throws TikaConfigException
      Loads and returns all metadata filters. Syntactic sugar for get(MetadataFilter.class). Results are cached - subsequent calls return the same instance.
      Returns:
      the metadata filter (typically a CompositeMetadataFilter internally)
      Throws:
      TikaConfigException - if loading fails
    • loadContentHandlerFactory

      public ContentHandlerFactory loadContentHandlerFactory() throws TikaConfigException
      Loads and returns the content handler factory. If "content-handler-factory" section exists in config, uses that factory. If section missing, returns a default BasicContentHandlerFactory with MARKDOWN handler. Results are cached - subsequent calls return the same instance.

      Example JSON:

       {
         "content-handler-factory": {
           "basic-content-handler-factory": {
             "type": "HTML",
             "writeLimit": 100000
           }
         }
       }
       
      Returns:
      the content handler factory
      Throws:
      TikaConfigException - if loading fails
    • loadRenderers

      public Renderer loadRenderers() throws TikaConfigException
      Loads and returns all renderers. Syntactic sugar for get(Renderer.class). Results are cached - subsequent calls return the same instance.
      Returns:
      the renderer (typically a CompositeRenderer internally)
      Throws:
      TikaConfigException - if loading fails
    • loadTranslator

      public Translator loadTranslator() throws TikaConfigException
      Loads and returns the translator. Syntactic sugar for get(Translator.class). Results are cached - subsequent calls return the same instance.
      Returns:
      the translator
      Throws:
      TikaConfigException - if loading fails
    • loadAutoDetectParser

      public Parser loadAutoDetectParser() throws TikaConfigException, IOException
      Loads and returns an AutoDetectParser configured with this loader's parsers and detectors. Results are cached - subsequent calls return the same instance.
      Returns:
      the auto-detect parser
      Throws:
      TikaConfigException - if loading fails
      IOException - if loading AutoDetectParserConfig fails
    • loadParseContext

      public ParseContext loadParseContext() throws TikaConfigException
      Loads and returns a ParseContext populated with components from the "parse-context" section.

      This method deserializes the parse-context JSON and resolves all component references using the component registry. Components are looked up by their friendly names (e.g., "embedded-limits", "pdf-parser-config") and deserialized to their appropriate types.

      Use this method when you need a pre-configured ParseContext for parsing operations.

      Example usage:

       TikaLoader loader = TikaLoader.load(configPath);
       Parser parser = loader.loadAutoDetectParser();
       ParseContext context = loader.loadParseContext();
       Metadata metadata = Metadata.newInstance(context);
       parser.parse(stream, handler, metadata, context);
       
      Returns:
      a ParseContext populated with configured components
      Throws:
      TikaConfigException - if loading fails
    • loadConfig

      public <T> T loadConfig(Class<T> clazz, T defaults) throws TikaConfigException
      Loads a configuration object from the "parse-context" section, merging with defaults.

      This method is useful when you have a base configuration (e.g., from code defaults or a previous load) and want to overlay values from the JSON config. Properties not specified in the JSON retain their default values.

      The original defaults object is NOT modified - a new instance is returned.

      Example usage for PDFParserConfig:

       // Load base config from tika-config.json at init time
       TikaLoader loader = TikaLoader.load(configPath);
       PDFParserConfig baseConfig = loader.loadConfig(PDFParserConfig.class, new PDFParserConfig());
      
       // At runtime, create per-request overrides
       PDFParserConfig requestConfig = new PDFParserConfig();
       requestConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
      
       // Merge: base config values + request overrides
       // (Note: for runtime merging, use JsonMergeUtils directly or loadConfig on a runtime loader)
       
      Type Parameters:
      T - the configuration type
      Parameters:
      clazz - the class to deserialize into
      defaults - the default values to use for properties not in the JSON config
      Returns:
      a new instance with defaults merged with JSON config, or the original defaults if not configured
      Throws:
      TikaConfigException - if loading fails
    • loadConfig

      public <T> T loadConfig(String key, Class<T> clazz, T defaults) throws TikaConfigException
      Loads a configuration object from the "parse-context" section by explicit key, merging with defaults.

      This method is useful when the JSON key doesn't match the class name's kebab-case conversion, or when you want to load from a specific key.

      Type Parameters:
      T - the configuration type
      Parameters:
      key - the JSON key in the "parse-context" section
      clazz - the class to deserialize into
      defaults - the default values to use for properties not in the JSON config
      Returns:
      a new instance with defaults merged with JSON config, or the original defaults if not configured
      Throws:
      TikaConfigException - if loading fails
    • getConfig

      public TikaJsonConfig getConfig()
      Gets the underlying JSON configuration.
      Returns:
      the JSON configuration
    • getClassLoader

      public ClassLoader getClassLoader()
      Gets the class loader used for loading components.
      Returns:
      the class loader
    • getMediaTypeRegistry

      public static MediaTypeRegistry getMediaTypeRegistry()
      Gets the media type registry. Lazily loads the default registry if not already set. This is a static singleton shared across all TikaLoader instances.
      Returns:
      the media type registry
    • getMimeTypes

      public static MimeTypes getMimeTypes()
    • loadGlobalSettings

      public GlobalSettings loadGlobalSettings() throws IOException, TikaConfigException
      Loads global configuration settings from the JSON config. These settings are applied to Tika's static configuration when loaded.

      Settings include:

      • metadata-list - Jackson StreamReadConstraints for JsonMetadata/JsonMetadataList serialization
      • service-loader - Service loader configuration
      • xml-reader-utils - XML parser security settings

      Example JSON:

       {
         "metadata-list": {
           "maxStringLength": 50000000,
           "maxNestingDepth": 10,
           "maxNumberLength": 500
         },
         "xml-reader-utils": {
           "maxEntityExpansions": 1000,
           "maxNumReuses": 100,
           "poolSize": 10
         }
       }
       
      Returns:
      the global settings, or an empty object if no settings are configured
      Throws:
      TikaConfigException - if loading fails
      IOException
    • getGlobalSettings

      public GlobalSettings getGlobalSettings()
      Gets the global settings if they have been loaded.
      Returns:
      the global settings, or null if not yet loaded
    • get

      public <T> T get(Class<T> componentClass) throws TikaConfigException
      Gets a component by its class type. Components are loaded lazily and cached.
      Parameters:
      componentClass - the component class (e.g., Parser.class, Detector.class)
      Returns:
      the loaded component
      Throws:
      TikaConfigException - if loading fails
    • get

      public <T> T get(String jsonField) throws TikaConfigException
      Gets a component by its JSON field name. Components are loaded lazily and cached.
      Parameters:
      jsonField - the JSON field name (e.g., "parsers", "detectors")
      Returns:
      the loaded component
      Throws:
      TikaConfigException - if loading fails
    • save

      public void save(File file) throws IOException
      Saves the current configuration to a JSON file (pretty-printed).
      Throws:
      IOException
    • save

      public void save(OutputStream outputStream) throws IOException
      Saves the current configuration to an output stream (pretty-printed).
      Throws:
      IOException
    • toJson

      public String toJson() throws IOException
      Converts the current configuration to a JSON string (pretty-printed).
      Throws:
      IOException