Configuring Encoding Detectors
Tika uses a chain of encoding detectors to determine the character encoding
of plain text and HTML content. The chain is controlled by
DefaultEncodingDetector, which loads detectors via the Java service-provider
interface (SPI) and runs them in registration order.
Default Detection Chain
The default chain when tika-charset-detectors-core is on the classpath:
| Step | Detector | Returns non-null when… |
|---|---|---|
1 |
|
A |
2 |
|
A UTF-8, UTF-16 LE/BE, or UTF-32 LE/BE byte-order mark is present. |
3 |
|
An HTML |
4 |
|
The built-in statistical model classifies the byte stream (~46 encodings, ~185 KB model bundled as a resource). |
5 (if present) |
|
State-machine structural prober (juniversalchardet fork). Automatically
joins the chain when |
6 (if present) |
|
A |
universal-encoding-detector and charsoup-encoding-detector are
supplied by separate optional modules (tika-charset-detectors-universal and
tika-langdetect-charsoup respectively). Each is loaded automatically via
SPI when its module is on the classpath and requires no extra configuration.
|
Design Rationale
The chain combines two complementary detection strategies:
-
Statistical (ML) — learns byte-bigram distributions from training data. Works well for documents with enough varied content (~100+ bytes).
-
Structural (Universal) — applies encoding-spec constraints (is this a valid lead+trail byte pair for Shift_JIS / EUC-JP / Big5 / GBK?). Works on as few as two bytes and is unaffected by content length.
Rules beat statistics at the extremes (very short or highly structured input);
statistics beat rules in the ambiguous middle where distributions are rich.
charsoup-encoding-detector arbitrates when they disagree.
Available Detectors
All detectors implement org.apache.tika.detect.EncodingDetector and can be
referenced by name in JSON configuration.
| Name | Module | Description |
|---|---|---|
|
|
Reads |
|
|
Byte-order mark detection (UTF-8/16/32). In the default chain. |
|
|
WHATWG-spec HTML charset prescan. In the default chain. |
|
|
Statistical multinomial logistic regression model (~46 encodings). In the default chain. |
|
|
State-machine structural prober; wraps the |
|
|
Older regex-based HTML meta-charset detector. Not in the default chain
(use |
|
|
Wraps ICU4J |
|
|
Language-aware arbitrator ( |
Configuration Examples
Exclude a detector from the default chain
{
"encoding-detectors": [
{
"default-encoding-detector": {
"exclude": ["bom-encoding-detector"]
}
}
]
}
Restrict to a lightweight chain (no Universal, no CharSoup)
Useful in resource-constrained environments when you only need the core statistical chain:
{
"encoding-detectors": [
{"http-header-encoding-detector": {}},
{"bom-encoding-detector": {}},
{"standard-html-encoding-detector": {}},
{"ml-encoding-detector": {}}
]
}
Configure the HTML detector’s read limit
The default limit is 8 192 bytes. Raise it if your HTML documents embed
large <script> blocks before the <meta charset> declaration.
{
"encoding-detectors": [
{"http-header-encoding-detector": {}},
{"bom-encoding-detector": {}},
{
"standard-html-encoding-detector": {
"markLimit": 65536
}
},
{"ml-encoding-detector": {}}
]
}