TesseractOCRParser Configuration

This page documents the configuration options for TesseractOCRParser in Tika 4.x.

Basic Configuration

{
  "parsers": [
    {
      "tesseract-ocr-parser": {
        "language": "eng",
        "timeoutSeconds": 120
      }
    }
  ]
}

Full Configuration

The following example shows all available configuration options with their default values. Comments indicate the available options for enum fields.

{
  "parsers": [
    {
      "tesseract-ocr-parser": {
        "applyRotation": false,
        "colorspace": "gray",
        "density": 300,
        "depth": 4,
        "enableImagePreprocessing": false,
        "filter": "triangle",
        "imageMagickPath": "",
        "inlineContent": false,
        "language": "eng",
        "maxFileSizeToOcr": 2147483647,
        "minFileSizeToOcr": 0,
        // Additional Tesseract configuration parameters as key-value pairs
        "otherTesseractConfig": {
          "preserve_interword_spaces": "1",
          "textord_initialx_ile": "0.75",
          "textord_noise_hfract": "0.15625"
        },
        // Options: TXT, HOCR
        "outputType": "TXT",
        "pageSeparator": "",
        "pageSegMode": "1",
        "preserveInterwordSpacing": false,
        "resize": 200,
        "skipOcr": false,
        "tessdataPath": "",
        "tesseractPath": "",
        "timeoutSeconds": 120
      }
    }
  ]
}

Changes from 3.x

In Tika 3.x, the otherTesseractSettings was a list of space-delimited key-value strings:

<!-- 3.x XML format -->
<param name="otherTesseractSettings" type="list">
  <string>textord_initialx_ile 0.75</string>
  <string>textord_noise_hfract 0.15625</string>
</param>

In Tika 4.x, this is replaced with otherTesseractConfig as a proper map:

// 4.x JSON format
"otherTesseractConfig": {
  "textord_initialx_ile": "0.75",
  "textord_noise_hfract": "0.15625"
}

The automatic converter handles this transformation.

See Migrating to 4.x for general migration guidance.