Class Tess4JConfig
java.lang.Object
org.apache.tika.parser.ocr.tess4j.Tess4JConfig
- All Implemented Interfaces:
Serializable
- Direct Known Subclasses:
Tess4JConfig.RuntimeConfig
Configuration for
Tess4JParser.
This class is not thread-safe and must be synchronized externally.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classRuntime-only Tess4JConfig that prevents modification of paths and pool settings during parse-time configuration. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionintgetDpi()longlonglongintintintintbooleanvoidsetDataPath(String dataPath) Set the path to the tessdata directory.voidsetDpi(int dpi) Set the DPI for image rendering.voidsetLanguage(String language) Set tesseract language dictionary to be used.voidsetMaxFileSizeToOcr(long maxFileSizeToOcr) voidsetMaxImagePixels(long maxImagePixels) Set the maximum total pixels (width × height) allowed for an image before OCR is skipped.voidsetMinFileSizeToOcr(long minFileSizeToOcr) voidsetNativeLibPath(String nativeLibPath) Set the path to the directory containing native Tesseract/Leptonica shared libraries.voidsetOcrEngineMode(int ocrEngineMode) Set OCR Engine Mode.voidsetPageSegMode(int pageSegMode) Set tesseract page segmentation mode.voidsetPoolSize(int poolSize) Set the number of Tesseract instances to keep in the pool.voidsetSkipOcr(boolean skipOcr) voidsetTimeoutSeconds(int timeoutSeconds) Set maximum time (seconds) to wait for a pooled Tesseract instance.
-
Constructor Details
-
Tess4JConfig
public Tess4JConfig()
-
-
Method Details
-
getLanguage
-
setLanguage
Set tesseract language dictionary to be used. Default is "eng". Multiple languages may be specified, separated by plus characters. e.g. "eng+fra" -
getDataPath
-
setDataPath
Set the path to the tessdata directory.- Throws:
TikaConfigException
-
getPageSegMode
public int getPageSegMode() -
setPageSegMode
public void setPageSegMode(int pageSegMode) Set tesseract page segmentation mode. Default is 1. -
getOcrEngineMode
public int getOcrEngineMode() -
setOcrEngineMode
public void setOcrEngineMode(int ocrEngineMode) Set OCR Engine Mode. Default is 3. -
getMaxFileSizeToOcr
public long getMaxFileSizeToOcr() -
setMaxFileSizeToOcr
public void setMaxFileSizeToOcr(long maxFileSizeToOcr) -
getMinFileSizeToOcr
public long getMinFileSizeToOcr() -
setMinFileSizeToOcr
public void setMinFileSizeToOcr(long minFileSizeToOcr) -
getPoolSize
public int getPoolSize() -
setPoolSize
public void setPoolSize(int poolSize) Set the number of Tesseract instances to keep in the pool. Default is 2. Must be at least 1. -
getTimeoutSeconds
public int getTimeoutSeconds() -
setTimeoutSeconds
public void setTimeoutSeconds(int timeoutSeconds) Set maximum time (seconds) to wait for a pooled Tesseract instance. Default is 120. -
isSkipOcr
public boolean isSkipOcr() -
setSkipOcr
public void setSkipOcr(boolean skipOcr) -
getDpi
public int getDpi() -
setDpi
public void setDpi(int dpi) Set the DPI for image rendering. Default is 300. -
getMaxImagePixels
public long getMaxImagePixels() -
setMaxImagePixels
public void setMaxImagePixels(long maxImagePixels) Set the maximum total pixels (width × height) allowed for an image before OCR is skipped. Default is 100,000,000 (100 megapixels). Set to-1for no limit (not recommended). -
getNativeLibPath
-
setNativeLibPath
Set the path to the directory containing native Tesseract/Leptonica shared libraries. On macOS with Homebrew this is typically/opt/homebrew/lib.- Throws:
TikaConfigException
-