Class EncodeOCRConfig

java.lang.Object
org.apache.tika.parser.ocrencode.EncodeOCRConfig
All Implemented Interfaces:
Serializable

public class EncodeOCRConfig extends Object implements Serializable
Configuration for EncodeOCRParser. This parser base64-encodes image bytes into the XHTML output instead of running OCR text extraction locally, so the size/count limits below govern which images are accepted for encoding, not text recognition.

The *Ocr field and setter names are retained to keep the tika-config/JSON parameter names stable; treat them as "OCR-encode".

This class is not thread safe and must be synchronized externally.

See Also:
  • Field Details

    • DEFAULT_MAX_FILE_SIZE_TO_OCR

      public static final long DEFAULT_MAX_FILE_SIZE_TO_OCR
      See Also:
  • Constructor Details

    • EncodeOCRConfig

      public EncodeOCRConfig()
  • Method Details

    • setInlineContent

      public void setInlineContent(boolean inlineContent)
    • isInlineContent

      public boolean isInlineContent()
    • getMinFileSizeToOcr

      public long getMinFileSizeToOcr()
      See Also:
    • setMinFileSizeToOcr

      public void setMinFileSizeToOcr(long minFileSizeToOcr)
      Set the minimum image size (in bytes) accepted for base64 encoding. Images smaller than this are skipped. Default is 0 (no lower bound).
    • getMaxFileSizeToOcr

      public long getMaxFileSizeToOcr()
      See Also:
    • setMaxFileSizeToOcr

      public void setMaxFileSizeToOcr(long maxFileSizeToOcr)
      Set the maximum image size (in bytes) accepted for base64 encoding. Images larger than this are skipped. Default is 104857600L bytes (100 MB).
    • isSkipOcr

      public boolean isSkipOcr()
    • setSkipOcr

      public void setSkipOcr(boolean skipOcr)
      If set to true, disables base64 encoding at runtime: the parser reports no supported types and parse() is a no-op. Use this to turn the parser off for a specific file without rewiring tika-config.
      Parameters:
      skipOcr -
    • getMaxImagesToOcr

      public int getMaxImagesToOcr()
    • setMaxImagesToOcr

      public void setMaxImagesToOcr(int maxImagesToOcr)
      Sets the maximum number of images to base64-encode per parse (across the whole document, tracked via ParseContext). Further images beyond this count are skipped. Default is 50.
      Parameters:
      maxImagesToOcr - maximum number of images to encode; must be >= 0