Configuration

This section covers configuring Apache Tika.

Overview

Tika 4.x uses JSON configuration files. Configuration controls parsers, detectors, content handlers, server behavior, and the Tika Pipes pipeline.

Tika 3.x and earlier used XML configuration (tika-config.xml). See the Migration Guide for details on converting to JSON.

Top-level JSON structure

A tika-config.json is a single JSON object whose keys are the top-level sections listed below. Every section is optional — omit what you don’t need. Defaults are used wherever a section is missing.

{
  "parsers": [ /* parser declarations */ ],
  "detectors": [ /* detector declarations */ ],
  "encoding-detectors": [ /* encoding detector declarations */ ],
  "content-handler-factory": { /* handler type for emitted content */ },
  "parse-context": {
    "timeout-limits": { /* progress + total task timeouts */ },
    "unpack-config": { /* embedded-byte extraction */ }
    /* other SelfConfiguring components by component name */
  },
  "server": { /* tika-server options: enableUnsecureFeatures, cors, ... */ },
  "pipes": { /* Pipes process management: numClients, parseMode, ... */ },
  "fetchers": { /* named fetcher instances */ },
  "emitters": { /* named emitter instances */ },
  "pipes-iterator": { /* iterator (one per pipeline) */ },
  "pipes-reporters": { /* per-document status reporters */ },
  "plugin-roots": "/path/to/plugins"
}

Per-section documentation:

  • parsers, detectors, encoding-detectors, content-handler-factory, parse-context — covered below under Topics.

  • server — see Tika Server.

  • pipes, fetchers, emitters, pipes-iterator, pipes-reporters, plugin-roots — see Pipes Configuration and Tika Pipes.

Topics

Parser Configuration

Other Configuration