public class TesseractOCRConfig extends Object implements Serializable
This class will remember all set* field forever,
and on cloneAndUpdate(TesseractOCRConfig)
,
it will update all the fields that have been set on the "update" config.
So, for example, if you want to change language to "fra"
from "eng" and then on another parse,
you want to change depth to 5 on the same update object,
but you expect the language to revert to "eng", you'll be wrong.
Create a new update config for each parse unless you're only changing the
same field(s) with every parse.
Modifier and Type | Class and Description |
---|---|
static class |
TesseractOCRConfig.OUTPUT_TYPE |
Constructor and Description |
---|
TesseractOCRConfig() |
Modifier and Type | Method and Description |
---|---|
void |
addOtherTesseractConfig(String key,
String value)
Add a key-value pair to pass to Tesseract using its -c command line option.
|
TesseractOCRConfig |
cloneAndUpdate(TesseractOCRConfig updates) |
String |
getColorspace() |
int |
getDensity() |
int |
getDepth() |
String |
getFilter() |
static void |
getLangs(String language,
Set<String> validLangs,
Set<String> invalidLangs)
This takes a language string, parses it and then bins individual langs into
valid or invalid based on regexes against the language codes
|
String |
getLanguage() |
long |
getMaxFileSizeToOcr() |
long |
getMinFileSizeToOcr() |
Map<String,String> |
getOtherTesseractConfig() |
TesseractOCRConfig.OUTPUT_TYPE |
getOutputType() |
String |
getPageSegMode() |
String |
getPageSeparator() |
int |
getResize() |
int |
getTimeoutSeconds() |
boolean |
isApplyRotation() |
boolean |
isEnableImagePreprocessing() |
boolean |
isPreserveInterwordSpacing() |
boolean |
isSkipOcr() |
void |
setApplyRotation(boolean applyRotation)
Sets whether or not a rotation value should be calculated and passed to ImageMagick.
|
void |
setColorspace(String colorspace) |
void |
setDensity(int density) |
void |
setDepth(int depth) |
void |
setEnableImagePreprocessing(boolean enableImagePreprocessing)
Set the value to true if processing is to be enabled.
|
void |
setFilter(String filter) |
void |
setLanguage(String languageString)
Set tesseract language dictionary to be used.
|
void |
setMaxFileSizeToOcr(long maxFileSizeToOcr)
Set maximum file size to submit file to ocr.
|
void |
setMinFileSizeToOcr(long minFileSizeToOcr)
Set minimum file size to submit file to ocr.
|
void |
setOutputType(String outputType) |
void |
setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)
Set output type from ocr process.
|
void |
setPageSegMode(String pageSegMode)
Set tesseract page segmentation mode.
|
void |
setPageSeparator(String pageSeparator)
The page separator to use in plain text output.
|
void |
setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
Whether or not to maintain interword spacing.
|
void |
setResize(int resize) |
void |
setSkipOcr(boolean skipOcr)
If you want to turn off OCR at run time for a specific file,
set this to
true |
void |
setTimeoutSeconds(int timeoutSeconds)
Set maximum time (seconds) to wait for the ocring process to terminate.
|
void |
setTrustedPageSeparator(String pageSeparator)
Same as
setPageSeparator(String) but does not perform
any checks on the string. |
public static void getLangs(String language, Set<String> validLangs, Set<String> invalidLangs)
language
- validLangs
- invalidLangs
- public String getLanguage()
setLanguage(String language)
public void setLanguage(String languageString)
public String getPageSegMode()
setPageSegMode(String pageSegMode)
public void setPageSegMode(String pageSegMode)
public String getPageSeparator()
setPageSeparator(String pageSeparator)
public void setPageSeparator(String pageSeparator)
pageSeparator
- public void setTrustedPageSeparator(String pageSeparator)
setPageSeparator(String)
but does not perform
any checks on the string.pageSeparator
- public boolean isPreserveInterwordSpacing()
public void setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
false
.preserveInterwordSpacing
- public long getMinFileSizeToOcr()
public void setMinFileSizeToOcr(long minFileSizeToOcr)
public long getMaxFileSizeToOcr()
public void setMaxFileSizeToOcr(long maxFileSizeToOcr)
public int getTimeoutSeconds()
setTimeoutSeconds(int timeout)
public void setTimeoutSeconds(int timeoutSeconds)
public TesseractOCRConfig.OUTPUT_TYPE getOutputType()
setOutputType(OUTPUT_TYPE outputType)
public void setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)
TesseractOCRConfig.OUTPUT_TYPE.TXT
.public void setOutputType(String outputType)
public boolean isEnableImagePreprocessing()
setEnableImagePreprocessing(boolean)
public void setEnableImagePreprocessing(boolean enableImagePreprocessing)
public int getDensity()
public void setDensity(int density)
density
- the density to set. Valid range of values is 150-1200.
Default value is 300.public int getDepth()
public void setDepth(int depth)
depth
- the depth to set. Valid values are 2, 4, 8, 16, 32, 64, 256, 4096.
Default value is 4.public String getColorspace()
public void setColorspace(String colorspace)
colorspace
- the colorspace to set
Deafult value is gray.public String getFilter()
public void setFilter(String filter)
filter
- the filter to set. Valid values are point, hermite, cubic, box, gaussian,
catrom, triangle, quadratic and mitchell.
Default value is triangle.public boolean isSkipOcr()
public void setSkipOcr(boolean skipOcr)
true
skipOcr
- public int getResize()
public void setResize(int resize)
resize
- the resize to set. Valid range of values is 100-900.
Default value is 900.public boolean isApplyRotation()
public void setApplyRotation(boolean applyRotation)
applyRotation
- to calculate and apply rotation, false to skip. Default is falsepublic Map<String,String> getOtherTesseractConfig()
addOtherTesseractConfig(String, String)
public void addOtherTesseractConfig(String key, String value)
You may also add these parameters in TesseractOCRConfig.properties; any key-value pair in the properties file where the key contains an underscore is passed directly to Tesseract.
key
- value
- public TesseractOCRConfig cloneAndUpdate(TesseractOCRConfig updates) throws TikaException
TikaException
Copyright © 2007–2023 The Apache Software Foundation. All rights reserved.