Package org.apache.tika.parser.ocr
Class TesseractOCRConfig
- java.lang.Object
-
- org.apache.tika.parser.ocr.TesseractOCRConfig
-
- All Implemented Interfaces:
Serializable
public class TesseractOCRConfig extends Object implements Serializable
Configuration for TesseractOCRParser. This class is not thread safe and must be synchronized externally.This class will remember all set* field forever, and on
cloneAndUpdate(TesseractOCRConfig)
, it will update all the fields that have been set on the "update" config. So, for example, if you want to change language to "fra" from "eng" and then on another parse, you want to change depth to 5 on the same update object, but you expect the language to revert to "eng", you'll be wrong. Create a new update config for each parse unless you're only changing the same field(s) with every parse.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
TesseractOCRConfig.OUTPUT_TYPE
-
Constructor Summary
Constructors Constructor Description TesseractOCRConfig()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addOtherTesseractConfig(String key, String value)
Add a key-value pair to pass to Tesseract using its -c command line option.TesseractOCRConfig
cloneAndUpdate(TesseractOCRConfig updates)
String
getColorspace()
int
getDensity()
int
getDepth()
String
getFilter()
static void
getLangs(String language, Set<String> validLangs, Set<String> invalidLangs)
This takes a language string, parses it and then bins individual langs into valid or invalid based on regexes against the language codesString
getLanguage()
long
getMaxFileSizeToOcr()
long
getMinFileSizeToOcr()
Map<String,String>
getOtherTesseractConfig()
TesseractOCRConfig.OUTPUT_TYPE
getOutputType()
String
getPageSegMode()
String
getPageSeparator()
int
getResize()
int
getTimeoutSeconds()
boolean
isApplyRotation()
boolean
isEnableImagePreprocessing()
boolean
isInlineContent()
boolean
isPreserveInterwordSpacing()
boolean
isSkipOcr()
void
setApplyRotation(boolean applyRotation)
Sets whether or not a rotation value should be calculated and passed to ImageMagick.void
setColorspace(String colorspace)
void
setDensity(int density)
void
setDepth(int depth)
void
setEnableImagePreprocessing(boolean enableImagePreprocessing)
Set the value to true if processing is to be enabled.void
setFilter(String filter)
void
setInlineContent(boolean inlineContent)
void
setLanguage(String languageString)
Set tesseract language dictionary to be used.void
setMaxFileSizeToOcr(long maxFileSizeToOcr)
Set maximum file size to submit file to ocr.void
setMinFileSizeToOcr(long minFileSizeToOcr)
Set minimum file size to submit file to ocr.void
setOutputType(String outputType)
void
setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)
Set output type from ocr process.void
setPageSegMode(String pageSegMode)
Set tesseract page segmentation mode.void
setPageSeparator(String pageSeparator)
The page separator to use in plain text output.void
setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
Whether or not to maintain interword spacing.void
setResize(int resize)
void
setSkipOcr(boolean skipOcr)
If you want to turn off OCR at run time for a specific file, set this totrue
void
setTimeoutSeconds(int timeoutSeconds)
Set maximum time (seconds) to wait for the ocring process to terminate.void
setTrustedPageSeparator(String pageSeparator)
Same assetPageSeparator(String)
but does not perform any checks on the string.
-
-
-
Method Detail
-
getLangs
public static void getLangs(String language, Set<String> validLangs, Set<String> invalidLangs)
This takes a language string, parses it and then bins individual langs into valid or invalid based on regexes against the language codes- Parameters:
language
-validLangs
-invalidLangs
-
-
getLanguage
public String getLanguage()
- See Also:
setLanguage(String language)
-
setLanguage
public void setLanguage(String languageString)
Set tesseract language dictionary to be used. Default is "eng". languages are either:- Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl
- A file path in the script directory. The name starts with upper-case letter. Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal
-
getPageSegMode
public String getPageSegMode()
- See Also:
setPageSegMode(String pageSegMode)
-
setPageSegMode
public void setPageSegMode(String pageSegMode)
Set tesseract page segmentation mode. Default is 1 = Automatic page segmentation with OSD (Orientation and Script Detection)
-
getPageSeparator
public String getPageSeparator()
- See Also:
setPageSeparator(String pageSeparator)
-
setPageSeparator
public void setPageSeparator(String pageSeparator)
The page separator to use in plain text output. This corresponds to Tesseract's page_separator config option. The default here is the empty string (i.e. no page separators). Note that this is also the default in Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed control character. We are overriding Tesseract 4.0's default here.- Parameters:
pageSeparator
-
-
setTrustedPageSeparator
public void setTrustedPageSeparator(String pageSeparator)
Same assetPageSeparator(String)
but does not perform any checks on the string.- Parameters:
pageSeparator
-
-
isPreserveInterwordSpacing
public boolean isPreserveInterwordSpacing()
- Returns:
- whether or not to maintain interword spacing.
-
setPreserveInterwordSpacing
public void setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
Whether or not to maintain interword spacing. Default isfalse
.- Parameters:
preserveInterwordSpacing
-
-
getMinFileSizeToOcr
public long getMinFileSizeToOcr()
-
setMinFileSizeToOcr
public void setMinFileSizeToOcr(long minFileSizeToOcr)
Set minimum file size to submit file to ocr. Default is 0.
-
getMaxFileSizeToOcr
public long getMaxFileSizeToOcr()
-
setMaxFileSizeToOcr
public void setMaxFileSizeToOcr(long maxFileSizeToOcr)
Set maximum file size to submit file to ocr. Default is Integer.MAX_VALUE.
-
getTimeoutSeconds
public int getTimeoutSeconds()
- Returns:
- timeout value for Tesseract
- See Also:
setTimeoutSeconds(int timeout)
-
setTimeoutSeconds
public void setTimeoutSeconds(int timeoutSeconds)
Set maximum time (seconds) to wait for the ocring process to terminate. Default value is 120s.
-
getOutputType
public TesseractOCRConfig.OUTPUT_TYPE getOutputType()
- See Also:
setOutputType(OUTPUT_TYPE outputType)
-
setOutputType
public void setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)
Set output type from ocr process. Default is "txt", but can be "hocr". Default value isTesseractOCRConfig.OUTPUT_TYPE.TXT
.
-
setOutputType
public void setOutputType(String outputType)
-
isEnableImagePreprocessing
public boolean isEnableImagePreprocessing()
- Returns:
- image processing is enabled or not
- See Also:
setEnableImagePreprocessing(boolean)
-
setEnableImagePreprocessing
public void setEnableImagePreprocessing(boolean enableImagePreprocessing)
Set the value to true if processing is to be enabled. Default value is false.
-
getDensity
public int getDensity()
- Returns:
- the density
-
setDensity
public void setDensity(int density)
- Parameters:
density
- the density to set. Valid range of values is 150-1200. Default value is 300.
-
getDepth
public int getDepth()
- Returns:
- the depth
-
setDepth
public void setDepth(int depth)
- Parameters:
depth
- the depth to set. Valid values are 2, 4, 8, 16, 32, 64, 256, 4096. Default value is 4.
-
getColorspace
public String getColorspace()
- Returns:
- the colorspace
-
setColorspace
public void setColorspace(String colorspace)
- Parameters:
colorspace
- the colorspace to set Deafult value is gray.
-
getFilter
public String getFilter()
- Returns:
- the filter
-
setFilter
public void setFilter(String filter)
- Parameters:
filter
- the filter to set. Valid values are point, hermite, cubic, box, gaussian, catrom, triangle, quadratic and mitchell. Default value is triangle.
-
isSkipOcr
public boolean isSkipOcr()
-
setSkipOcr
public void setSkipOcr(boolean skipOcr)
If you want to turn off OCR at run time for a specific file, set this totrue
- Parameters:
skipOcr
-
-
getResize
public int getResize()
- Returns:
- the resize
-
setResize
public void setResize(int resize)
- Parameters:
resize
- the resize to set. Valid range of values is 100-900. Default value is 900.
-
isApplyRotation
public boolean isApplyRotation()
- Returns:
- Whether or not a rotation value should be calculated and passed to ImageMagick before performing OCR.
-
setInlineContent
public void setInlineContent(boolean inlineContent)
-
isInlineContent
public boolean isInlineContent()
-
setApplyRotation
public void setApplyRotation(boolean applyRotation)
Sets whether or not a rotation value should be calculated and passed to ImageMagick.- Parameters:
applyRotation
- to calculate and apply rotation, false to skip. Default is false
-
getOtherTesseractConfig
public Map<String,String> getOtherTesseractConfig()
- See Also:
addOtherTesseractConfig(String, String)
-
addOtherTesseractConfig
public void addOtherTesseractConfig(String key, String value)
Add a key-value pair to pass to Tesseract using its -c command line option. To see the possible options, run tesseract --print-parameters.You may also add these parameters in TesseractOCRConfig.properties; any key-value pair in the properties file where the key contains an underscore is passed directly to Tesseract.
- Parameters:
key
-value
-
-
cloneAndUpdate
public TesseractOCRConfig cloneAndUpdate(TesseractOCRConfig updates) throws TikaException
- Throws:
TikaException
-
-