Package org.apache.tika.parser.ocr
Class TesseractOCRConfig
- java.lang.Object
- 
- org.apache.tika.parser.ocr.TesseractOCRConfig
 
- 
- All Implemented Interfaces:
- Serializable
 
 public class TesseractOCRConfig extends Object implements Serializable Configuration for TesseractOCRParser. This class is not thread safe and must be synchronized externally.This class will remember all set* field forever, and on cloneAndUpdate(TesseractOCRConfig), it will update all the fields that have been set on the "update" config. So, for example, if you want to change language to "fra" from "eng" and then on another parse, you want to change depth to 5 on the same update object, but you expect the language to revert to "eng", you'll be wrong. Create a new update config for each parse unless you're only changing the same field(s) with every parse.- See Also:
- Serialized Form
 
- 
- 
Nested Class SummaryNested Classes Modifier and Type Class Description static classTesseractOCRConfig.OUTPUT_TYPE
 - 
Constructor SummaryConstructors Constructor Description TesseractOCRConfig()
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddOtherTesseractConfig(String key, String value)Add a key-value pair to pass to Tesseract using its -c command line option.TesseractOCRConfigcloneAndUpdate(TesseractOCRConfig updates)StringgetColorspace()intgetDensity()intgetDepth()StringgetFilter()static voidgetLangs(String language, Set<String> validLangs, Set<String> invalidLangs)This takes a language string, parses it and then bins individual langs into valid or invalid based on regexes against the language codesStringgetLanguage()longgetMaxFileSizeToOcr()longgetMinFileSizeToOcr()Map<String,String>getOtherTesseractConfig()TesseractOCRConfig.OUTPUT_TYPEgetOutputType()StringgetPageSegMode()StringgetPageSeparator()intgetResize()intgetTimeoutSeconds()booleanisApplyRotation()booleanisEnableImagePreprocessing()booleanisPreserveInterwordSpacing()booleanisSkipOcr()voidsetApplyRotation(boolean applyRotation)Sets whether or not a rotation value should be calculated and passed to ImageMagick.voidsetColorspace(String colorspace)voidsetDensity(int density)voidsetDepth(int depth)voidsetEnableImagePreprocessing(boolean enableImagePreprocessing)Set the value to true if processing is to be enabled.voidsetFilter(String filter)voidsetLanguage(String languageString)Set tesseract language dictionary to be used.voidsetMaxFileSizeToOcr(long maxFileSizeToOcr)Set maximum file size to submit file to ocr.voidsetMinFileSizeToOcr(long minFileSizeToOcr)Set minimum file size to submit file to ocr.voidsetOutputType(String outputType)voidsetOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)Set output type from ocr process.voidsetPageSegMode(String pageSegMode)Set tesseract page segmentation mode.voidsetPageSeparator(String pageSeparator)The page separator to use in plain text output.voidsetPreserveInterwordSpacing(boolean preserveInterwordSpacing)Whether or not to maintain interword spacing.voidsetResize(int resize)voidsetSkipOcr(boolean skipOcr)If you want to turn off OCR at run time for a specific file, set this totruevoidsetTimeoutSeconds(int timeoutSeconds)Set maximum time (seconds) to wait for the ocring process to terminate.voidsetTrustedPageSeparator(String pageSeparator)Same assetPageSeparator(String)but does not perform any checks on the string.
 
- 
- 
- 
Method Detail- 
getLangspublic static void getLangs(String language, Set<String> validLangs, Set<String> invalidLangs) This takes a language string, parses it and then bins individual langs into valid or invalid based on regexes against the language codes- Parameters:
- language-
- validLangs-
- invalidLangs-
 
 - 
getLanguagepublic String getLanguage() - See Also:
- setLanguage(String language)
 
 - 
setLanguagepublic void setLanguage(String languageString) Set tesseract language dictionary to be used. Default is "eng". languages are either:- Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl
- A file path in the script directory. The name starts with upper-case letter. Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal
 
 - 
getPageSegModepublic String getPageSegMode() - See Also:
- setPageSegMode(String pageSegMode)
 
 - 
setPageSegModepublic void setPageSegMode(String pageSegMode) Set tesseract page segmentation mode. Default is 1 = Automatic page segmentation with OSD (Orientation and Script Detection)
 - 
getPageSeparatorpublic String getPageSeparator() - See Also:
- setPageSeparator(String pageSeparator)
 
 - 
setPageSeparatorpublic void setPageSeparator(String pageSeparator) The page separator to use in plain text output. This corresponds to Tesseract's page_separator config option. The default here is the empty string (i.e. no page separators). Note that this is also the default in Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed control character. We are overriding Tesseract 4.0's default here.- Parameters:
- pageSeparator-
 
 - 
setTrustedPageSeparatorpublic void setTrustedPageSeparator(String pageSeparator) Same assetPageSeparator(String)but does not perform any checks on the string.- Parameters:
- pageSeparator-
 
 - 
isPreserveInterwordSpacingpublic boolean isPreserveInterwordSpacing() - Returns:
- whether or not to maintain interword spacing.
 
 - 
setPreserveInterwordSpacingpublic void setPreserveInterwordSpacing(boolean preserveInterwordSpacing) Whether or not to maintain interword spacing. Default isfalse.- Parameters:
- preserveInterwordSpacing-
 
 - 
getMinFileSizeToOcrpublic long getMinFileSizeToOcr() 
 - 
setMinFileSizeToOcrpublic void setMinFileSizeToOcr(long minFileSizeToOcr) Set minimum file size to submit file to ocr. Default is 0.
 - 
getMaxFileSizeToOcrpublic long getMaxFileSizeToOcr() 
 - 
setMaxFileSizeToOcrpublic void setMaxFileSizeToOcr(long maxFileSizeToOcr) Set maximum file size to submit file to ocr. Default is Integer.MAX_VALUE.
 - 
getTimeoutSecondspublic int getTimeoutSeconds() - Returns:
- timeout value for Tesseract
- See Also:
- setTimeoutSeconds(int timeout)
 
 - 
setTimeoutSecondspublic void setTimeoutSeconds(int timeoutSeconds) Set maximum time (seconds) to wait for the ocring process to terminate. Default value is 120s.
 - 
getOutputTypepublic TesseractOCRConfig.OUTPUT_TYPE getOutputType() - See Also:
- setOutputType(OUTPUT_TYPE outputType)
 
 - 
setOutputTypepublic void setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType) Set output type from ocr process. Default is "txt", but can be "hocr". Default value isTesseractOCRConfig.OUTPUT_TYPE.TXT.
 - 
setOutputTypepublic void setOutputType(String outputType) 
 - 
isEnableImagePreprocessingpublic boolean isEnableImagePreprocessing() - Returns:
- image processing is enabled or not
- See Also:
- setEnableImagePreprocessing(boolean)
 
 - 
setEnableImagePreprocessingpublic void setEnableImagePreprocessing(boolean enableImagePreprocessing) Set the value to true if processing is to be enabled. Default value is false.
 - 
getDensitypublic int getDensity() - Returns:
- the density
 
 - 
setDensitypublic void setDensity(int density) - Parameters:
- density- the density to set. Valid range of values is 150-1200. Default value is 300.
 
 - 
getDepthpublic int getDepth() - Returns:
- the depth
 
 - 
setDepthpublic void setDepth(int depth) - Parameters:
- depth- the depth to set. Valid values are 2, 4, 8, 16, 32, 64, 256, 4096. Default value is 4.
 
 - 
getColorspacepublic String getColorspace() - Returns:
- the colorspace
 
 - 
setColorspacepublic void setColorspace(String colorspace) - Parameters:
- colorspace- the colorspace to set Deafult value is gray.
 
 - 
getFilterpublic String getFilter() - Returns:
- the filter
 
 - 
setFilterpublic void setFilter(String filter) - Parameters:
- filter- the filter to set. Valid values are point, hermite, cubic, box, gaussian, catrom, triangle, quadratic and mitchell. Default value is triangle.
 
 - 
isSkipOcrpublic boolean isSkipOcr() 
 - 
setSkipOcrpublic void setSkipOcr(boolean skipOcr) If you want to turn off OCR at run time for a specific file, set this totrue- Parameters:
- skipOcr-
 
 - 
getResizepublic int getResize() - Returns:
- the resize
 
 - 
setResizepublic void setResize(int resize) - Parameters:
- resize- the resize to set. Valid range of values is 100-900. Default value is 900.
 
 - 
isApplyRotationpublic boolean isApplyRotation() - Returns:
- Whether or not a rotation value should be calculated and passed to ImageMagick before performing OCR.
 
 - 
setApplyRotationpublic void setApplyRotation(boolean applyRotation) Sets whether or not a rotation value should be calculated and passed to ImageMagick.- Parameters:
- applyRotation- to calculate and apply rotation, false to skip. Default is false
 
 - 
getOtherTesseractConfigpublic Map<String,String> getOtherTesseractConfig() - See Also:
- addOtherTesseractConfig(String, String)
 
 - 
addOtherTesseractConfigpublic void addOtherTesseractConfig(String key, String value) Add a key-value pair to pass to Tesseract using its -c command line option. To see the possible options, run tesseract --print-parameters.You may also add these parameters in TesseractOCRConfig.properties; any key-value pair in the properties file where the key contains an underscore is passed directly to Tesseract. - Parameters:
- key-
- value-
 
 - 
cloneAndUpdatepublic TesseractOCRConfig cloneAndUpdate(TesseractOCRConfig updates) throws TikaException - Throws:
- TikaException
 
 
- 
 
-