TesseractOCRParser Configuration
Table of Contents
This page documents the configuration options for TesseractOCRParser in Tika 4.x.
Basic Configuration
{
"parsers": [
{
"tesseract-ocr-parser": {
"language": "eng",
"timeoutSeconds": 120
}
}
]
}
Full Configuration
The following example shows all available configuration options with their default values. Comments indicate the available options for enum fields.
{
"parsers": [
{
"tesseract-ocr-parser": {
"applyRotation": false,
"colorspace": "gray",
"density": 300,
"depth": 4,
"enableImagePreprocessing": false,
"filter": "triangle",
"imageMagickPath": "",
"inlineContent": false,
"language": "eng",
"maxFileSizeToOcr": 2147483647,
"minFileSizeToOcr": 0,
// Additional Tesseract configuration parameters as key-value pairs
"otherTesseractConfig": {
"preserve_interword_spaces": "1",
"textord_initialx_ile": "0.75",
"textord_noise_hfract": "0.15625"
},
// Options: TXT, HOCR
"outputType": "TXT",
"pageSeparator": "",
"pageSegMode": "1",
"preserveInterwordSpacing": false,
"resize": 200,
"skipOcr": false,
"tessdataPath": "",
"tesseractPath": "",
"timeoutSeconds": 120
}
}
]
}
Changes from 3.x
In Tika 3.x, the otherTesseractSettings was a list of space-delimited key-value strings:
<!-- 3.x XML format -->
<param name="otherTesseractSettings" type="list">
<string>textord_initialx_ile 0.75</string>
<string>textord_noise_hfract 0.15625</string>
</param>
In Tika 4.x, this is replaced with otherTesseractConfig as a proper map:
// 4.x JSON format
"otherTesseractConfig": {
"textord_initialx_ile": "0.75",
"textord_noise_hfract": "0.15625"
}
The automatic converter handles this transformation.
See Migrating to 4.x for general migration guidance.