Configuring Encoding Detectors

Table of Contents

Default Detection Chain
Available Detectors
Configuration Examples

Tika uses a chain of encoding detectors to determine the character encoding of plain text and HTML content. DefaultEncodingDetector discovers detectors via the Java service-provider interface (SPI, META-INF/services).

The chain runs in one of two modes:

collect-all — when a MetaEncodingDetector is present (the 4.x default includes one), every base detector runs and contributes candidate encodings, then the meta detector picks the best one by decode quality. Registration order does not matter.
first-match-wins — otherwise, detectors run in registration order and the first non-null result is used.

Default Detection Chain

The stock 4.x distribution registers five detectors:

Detector Module Role

Detector	Module	Role
`bom-detector`	`tika-core`	Emits a candidate from a leading byte-order mark.
`metadata-charset-detector`	`tika-core`	Emits a candidate from declarative hints (`Content-Type` charset, `Content-Encoding`) in the `Metadata` object.
`html-encoding-detector`	`tika-encoding-detector-html`	Emits a candidate from an HTML `<meta charset>` / `http-equiv` tag (lenient regex over a curated subset of WHATWG label aliases).
`mojibuster-encoding-detector`	`tika-encoding-detector-mojibuster`	Byte-bigram Naive Bayes classifier plus structural detectors for UTF-32 and UTF-16 and a UTF-8 grammar gate.
`junk-filter-encoding-detector`	`tika-ml-junkdetect`	`MetaEncodingDetector` that picks among the other detectors' candidates by script-aware decode quality. Always runs last.

bom-detector

tika-core

Emits a candidate from a leading byte-order mark.

metadata-charset-detector

tika-core

Emits a candidate from declarative hints (Content-Type charset, Content-Encoding) in the Metadata object.

html-encoding-detector

tika-encoding-detector-html

Emits a candidate from an HTML <meta charset> / http-equiv tag (lenient regex over a curated subset of WHATWG label aliases).

mojibuster-encoding-detector

tika-encoding-detector-mojibuster

Byte-bigram Naive Bayes classifier plus structural detectors for UTF-32 and UTF-16 and a UTF-8 grammar gate.

junk-filter-encoding-detector

tika-ml-junkdetect

MetaEncodingDetector that picks among the other detectors' candidates by script-aware decode quality. Always runs last.

Because junk-filter-encoding-detector is a MetaEncodingDetector, the chain runs collect-all: detector order is irrelevant, and a declaration (a BOM or a <meta charset> tag) does not automatically win. The junk filter will override a declaration — or even a BOM — when the byte evidence strongly contradicts it.

This is a behaviour change from 3.x, whose default chain was html / universal / icu4j with first-match-wins (a declaration always won). universal-encoding-detector and icu4j-encoding-detector are no longer in the default distribution; see Restore the 3.x chain.

Available Detectors

All detectors implement org.apache.tika.detect.EncodingDetector and can be referenced by their SPI name in JSON configuration.

Name Module Description

Name	Module	Description
`bom-detector`	`tika-core`	Reads a leading byte-order mark. In the default chain.
`metadata-charset-detector`	`tika-core`	Reads declarative hints (`Content-Type` charset, `Content-Encoding`) from the `Metadata` object. In the default chain.
`html-encoding-detector`	`tika-encoding-detector-html`	Fast lenient regex matcher for `<meta charset>` / `http-equiv` tags. In the default chain.
`mojibuster-encoding-detector`	`tika-encoding-detector-mojibuster`	Byte-bigram Naive Bayes classifier with structural UTF-32/UTF-16 detectors and a UTF-8 grammar gate. In the default chain.
`junk-filter-encoding-detector`	`tika-ml-junkdetect`	Text-quality arbitrator (`MetaEncodingDetector`). In the default chain; runs last.
`standard-html-encoding-detector`	`tika-encoding-detector-html`	Spec-strict WHATWG prescan algorithm. Not in the default chain — opt in if you need strict WHATWG tokenisation (e.g. ignoring charset declarations inside HTML comments).
`universal-encoding-detector`	`tika-encoding-detector-universal`	State-machine structural prober (juniversalchardet fork). Not bundled and not auto-discovered; add the jar and configure it explicitly to use it.
`icu4j-encoding-detector`	`tika-encoding-detector-icu4j`	Wraps ICU4J’s `CharsetDetector`. Not bundled and not auto-discovered; add the jar and configure it explicitly to use it.

bom-detector

tika-core

Reads a leading byte-order mark. In the default chain.

metadata-charset-detector

tika-core

Reads declarative hints (Content-Type charset, Content-Encoding) from the Metadata object. In the default chain.

html-encoding-detector

tika-encoding-detector-html

Fast lenient regex matcher for <meta charset> / http-equiv tags. In the default chain.

mojibuster-encoding-detector

tika-encoding-detector-mojibuster

Byte-bigram Naive Bayes classifier with structural UTF-32/UTF-16 detectors and a UTF-8 grammar gate. In the default chain.

junk-filter-encoding-detector

tika-ml-junkdetect

Text-quality arbitrator (MetaEncodingDetector). In the default chain; runs last.

standard-html-encoding-detector

tika-encoding-detector-html

Spec-strict WHATWG prescan algorithm. Not in the default chain — opt in if you need strict WHATWG tokenisation (e.g. ignoring charset declarations inside HTML comments).

universal-encoding-detector

tika-encoding-detector-universal

State-machine structural prober (juniversalchardet fork). Not bundled and not auto-discovered; add the jar and configure it explicitly to use it.

icu4j-encoding-detector

tika-encoding-detector-icu4j

Wraps ICU4J’s CharsetDetector. Not bundled and not auto-discovered; add the jar and configure it explicitly to use it.

Configuration Examples

Exclude a detector from the default chain

Use default-encoding-detector with an exclude list to drop one or more auto-registered detectors:

{
  "encoding-detectors": [
    {
      "default-encoding-detector": {
        "exclude": ["html-encoding-detector"]
      }
    }
  ]
}

Do not combine default-encoding-detector with other explicit detector entries in the same list. When combined, the loader wraps everything in an outer composite that has no MetaEncodingDetector at its top level, so collect-all arbitration is silently lost and the explicit detectors are never reached. Use an explicit chain (see below) when you need to configure individual detectors.

Specify the chain explicitly

To replace the SPI-discovered chain with an explicit ordered list. Include junk-filter-encoding-detector (last) to keep collect-all arbitration; omit it for first-match-wins:

{
  "encoding-detectors": [
    {"html-encoding-detector": {}},
    {"mojibuster-encoding-detector": {}},
    {"junk-filter-encoding-detector": {}}
  ]
}

Configure the HTML detector’s read limit

html-encoding-detector reads up to 65 536 bytes by default when scanning for the <meta charset> tag. Raise it if your documents embed large <script> blocks before the meta tag (TIKA-2485). (mojibuster-encoding-detector reads a larger content probe, so in the default chain this limit matters mainly for very large preambles.)

To configure markLimit, specify the full chain explicitly. An explicit list that includes junk-filter-encoding-detector keeps collect-all arbitration; the configured html-encoding-detector participates as a base detector alongside Mojibuster, and the junk filter arbitrates as usual:

{
  "encoding-detectors": [
    {"html-encoding-detector": {"markLimit": 131072}},
    {"mojibuster-encoding-detector": {}},
    {"junk-filter-encoding-detector": {}}
  ]
}

Use the spec-strict WHATWG HTML detector

If your input HTML has charset declarations inside comments (or other contexts where the lenient regex would false-match), opt in to the spec-strict prescan:

{
  "encoding-detectors": [
    {"standard-html-encoding-detector": {}},
    {"mojibuster-encoding-detector": {}},
    {"junk-filter-encoding-detector": {}}
  ]
}

Restore the 3.x detection chain (universal + icu4j)

The 4.x default no longer bundles or auto-registers universal-encoding-detector and icu4j-encoding-detector. To get the legacy 3.x behaviour (html / universal / icu4j, first-match-wins) you must do both:

Add the jars to the classpath. They are no longer in the tika-app / tika-server-standard packages, so supply tika-encoding-detector-universal and tika-encoding-detector-icu4j yourself (for example via -Dtika.extras.dir — see the configuration overview).
Configure the chain explicitly. An explicit chain with no MetaEncodingDetector runs first-match-wins:

{
  "encoding-detectors": [
    {"html-encoding-detector": {}},
    {"universal-encoding-detector": {}},
    {"icu4j-encoding-detector": {}}
  ]
}

Dropping the jars on the classpath alone is not enough: unlike the other detectors, these two are config-only and are not auto-discovered via SPI.