Configuring Encoding Detectors
Tika uses a chain of encoding detectors to determine the character encoding
of plain text and HTML content. DefaultEncodingDetector loads detectors via
the Java service-provider interface (SPI) and runs them in registration order;
the first non-null result wins.
The default chain is html-encoding-detector, universal-encoding-detector,
and icu4j-encoding-detector.
Default Detection Chain
With the stock dependencies on the classpath (the modules
tika-encoding-detector-html, tika-encoding-detector-universal, and
tika-encoding-detector-icu4j):
| Step | Detector | Returns non-null when… |
|---|---|---|
1 |
|
An HTML |
2 |
|
A state-machine structural prober (juniversalchardet fork) recognises the byte pattern as a known encoding (UTF-8, GB18030, Big5, EUC-JP, several ISO-8859 variants, etc.). |
3 |
|
ICU4J’s |
The chain is permissive — first-match-wins. A declared charset
(e.g. from a <meta charset> tag) wins over later structural or statistical
detectors.
Available Detectors
All detectors implement org.apache.tika.detect.EncodingDetector and can be
referenced by their SPI name in JSON configuration.
| Name | Module | Description |
|---|---|---|
|
|
Fast lenient regex matcher for |
|
|
State-machine structural prober (juniversalchardet fork). Auto-registered (in default chain). |
|
|
Wraps ICU4J’s |
|
|
Spec-strict WHATWG prescan algorithm. Not in the default chain — opt in explicitly if you need strict WHATWG tokenisation (e.g. ignoring charset declarations inside HTML comments or other contexts the lenient regex may match). |
|
|
Byte-bigram Naive Bayes classifier plus structural detectors for UTF-32 and UTF-16 and a UTF-8 grammar gate. Not in the default chain — opt in explicitly. |
|
|
Text-quality arbitrator ( |
|
|
Reads the first 4 bytes for BOM signatures. Helper component, used
internally by |
|
|
Reads declarative hints ( |
Configuration Examples
Exclude a detector from the default chain
Use default-encoding-detector with an exclude list to drop one or more
auto-registered detectors:
{
"encoding-detectors": [
{
"default-encoding-detector": {
"exclude": ["icu4j-encoding-detector"]
}
}
]
}
Specify the chain explicitly
To replace the SPI-discovered chain with an explicit ordered list:
{
"encoding-detectors": [
{"html-encoding-detector": {}},
{"universal-encoding-detector": {}}
]
}
Configure the HTML detector’s read limit
html-encoding-detector reads up to 65 536 bytes by default when scanning
for the <meta charset> tag. Raise it if your documents embed large
<script> blocks before the meta tag (TIKA-2485):
{
"encoding-detectors": [
{
"html-encoding-detector": {
"markLimit": 131072
}
},
{"universal-encoding-detector": {}},
{"icu4j-encoding-detector": {}}
]
}
Use the spec-strict WHATWG HTML detector
If your input HTML has charset declarations inside comments (or other contexts where the lenient regex would false-match), opt in to the spec-strict prescan:
{
"encoding-detectors": [
{"standard-html-encoding-detector": {}},
{"universal-encoding-detector": {}},
{"icu4j-encoding-detector": {}}
]
}
Add the Mojibuster + JunkFilter chain (opt-in)
The byte-bigram NB classifier (mojibuster-encoding-detector) and the
text-quality arbitrator (junk-filter-encoding-detector) are available as
opt-in components. They require the tika-encoding-detector-mojibuster
and tika-ml-junkdetect modules on the classpath:
{
"encoding-detectors": [
{"html-encoding-detector": {}},
{"mojibuster-encoding-detector": {}},
{"junk-filter-encoding-detector": {}}
]
}
junk-filter-encoding-detector is a MetaEncodingDetector — it collects
candidates from the other detectors and picks the cleanest decoding via a
script-aware text-quality model. It must run last.