public class TesseractOCRParser extends AbstractExternalProcessParser implements Initializable
TesseractOCRConfig
object and pass it through a
ParseContext. Tesseract-ocr must be installed and on system path or the path
to its root folder must be provided:
TesseractOCRConfig config = new TesseractOCRConfig();
//Needed if tesseract is not on system path
config.setTesseractPath(tesseractFolder);
parseContext.set(TesseractOCRConfig.class, config);
Modifier and Type | Field and Description |
---|---|
static Property |
IMAGE_MAGICK |
static Property |
IMAGE_ROTATION |
static Property |
PSM0_ORIENTATION |
static Property |
PSM0_ORIENTATION_CONFIDENCE |
static Property |
PSM0_PAGE_NUMBER |
static Property |
PSM0_ROTATE |
static Property |
PSM0_SCRIPT |
static Property |
PSM0_SCRIPT_CONFIDENCE |
static String |
TESS_META |
Constructor and Description |
---|
TesseractOCRParser() |
Modifier and Type | Method and Description |
---|---|
void |
checkInitialization(InitializableProblemHandler problemHandler) |
String |
getColorspace() |
TesseractOCRConfig |
getDefaultConfig() |
int |
getDensity() |
int |
getDepth() |
String |
getFilter() |
String |
getImageMagickPath() |
static String |
getImageMagickProg() |
Set<String> |
getLangs() |
String |
getLanguage() |
long |
getMaxFileSizeToOcr() |
long |
getMinFileSizeToOcr() |
List<String> |
getOtherTesseractSettings() |
String |
getOutputType() |
String |
getPageSegMode() |
int |
getResize() |
Set<MediaType> |
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used
with the given parse context.
|
String |
getTessdataPath() |
String |
getTesseractPath() |
static String |
getTesseractProg() |
int |
getTimeout() |
boolean |
hasTesseract() |
protected boolean |
hasWarned() |
void |
initialize(Map<String,Param> params) |
boolean |
isApplyRotation() |
boolean |
isEnableImagePreprocessing() |
boolean |
isPreloadLangs() |
boolean |
isPreserveInterwordSpacing() |
boolean |
isSkipOCR() |
void |
parse(Image image,
ContentHandler handler,
Metadata metadata,
ParseContext context) |
void |
parse(InputStream stream,
ContentHandler handler,
Metadata metadata,
ParseContext parseContext)
Parses a document stream into a sequence of XHTML SAX events.
|
void |
setApplyRotation(boolean applyRotation) |
void |
setColorspace(String colorspace) |
void |
setDensity(int density) |
void |
setDepth(int depth) |
void |
setEnableImagePreprocessing(boolean enableImagePreprocessing) |
void |
setFilter(String filter) |
void |
setImageMagickPath(String imageMagickPath)
Set the path to the ImageMagick executable directory, needed if it is not on system path.
|
void |
setLanguage(String language) |
void |
setMaxFileSizeToOcr(long maxFileSizeToOcr) |
void |
setMinFileSizeToOcr(long minFileSizeToOcr) |
void |
setOtherTesseractSettings(List<String> settings) |
void |
setOutputType(String outputType) |
void |
setPageSegMode(String pageSegMode) |
void |
setPreloadLangs(boolean preloadLangs)
If set to
true and if tesseract is found, this will load the
langs that result from --list-langs. |
void |
setPreserveInterwordSpacing(boolean preserveInterwordSpacing) |
void |
setResize(int resize) |
void |
setSkipOCR(boolean skipOCR) |
void |
setTessdataPath(String tessdataPath)
Set the path to the 'tessdata' folder, which contains language files and config files.
|
void |
setTesseractPath(String tesseractPath)
Set the path to the Tesseract executable's directory, needed if it is not on system path.
|
void |
setTimeout(int timeout)
Set default timeout in seconds.
|
protected void |
warn() |
register, release
parse
public static final String TESS_META
public static final Property IMAGE_ROTATION
public static final Property IMAGE_MAGICK
public static final Property PSM0_PAGE_NUMBER
public static final Property PSM0_ORIENTATION
public static final Property PSM0_ROTATE
public static final Property PSM0_ORIENTATION_CONFIDENCE
public static final Property PSM0_SCRIPT
public static final Property PSM0_SCRIPT_CONFIDENCE
public static String getImageMagickProg()
public static String getTesseractProg()
public Set<MediaType> getSupportedTypes(ParseContext context)
Parser
getSupportedTypes
in interface Parser
context
- parse contextpublic boolean hasTesseract() throws TikaConfigException
TikaConfigException
public void parse(Image image, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
IOException
SAXException
TikaException
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext parseContext) throws IOException, SAXException, TikaException
Parser
The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
parse
in interface Parser
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)parseContext
- parse contextIOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsedpublic void initialize(Map<String,Param> params) throws TikaConfigException
initialize
in interface Initializable
params
- params to use for initializationTikaConfigException
public void checkInitialization(InitializableProblemHandler problemHandler) throws TikaConfigException
checkInitialization
in interface Initializable
problemHandler
- if there is a problem and no
custom initializableProblemHandler has been configured
via Initializable parameters,
this is called to respond.TikaConfigException
protected boolean hasWarned()
protected void warn()
public String getTesseractPath()
@Field public void setTesseractPath(String tesseractPath)
Note that if you set this value, it is highly recommended that you also
set the path to (and including) the 'tessdata' folder using setTessdataPath(java.lang.String)
.
public String getTessdataPath()
@Field public void setTessdataPath(String tessdataPath)
public String getImageMagickPath()
@Field public void setImageMagickPath(String imageMagickPath)
imageMagickPath
- to ImageMagick executable directory.@Field public void setOtherTesseractSettings(List<String> settings) throws TikaConfigException
TikaConfigException
@Field public void setSkipOCR(boolean skipOCR)
public boolean isSkipOCR()
public String getLanguage()
public String getPageSegMode()
@Field public void setMaxFileSizeToOcr(long maxFileSizeToOcr)
public long getMaxFileSizeToOcr()
@Field public void setMinFileSizeToOcr(long minFileSizeToOcr)
public long getMinFileSizeToOcr()
@Field public void setTimeout(int timeout)
TikaTaskTimeout
sent in via the ParseContext
at parse time.timeout
- public int getTimeout()
public String getOutputType()
@Field public void setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
public boolean isPreserveInterwordSpacing()
@Field public void setEnableImagePreprocessing(boolean enableImagePreprocessing)
public boolean isEnableImagePreprocessing()
@Field public void setDensity(int density)
public int getDensity()
@Field public void setDepth(int depth)
public int getDepth()
public String getColorspace()
public String getFilter()
@Field public void setResize(int resize)
public int getResize()
@Field public void setApplyRotation(boolean applyRotation)
public boolean isApplyRotation()
@Field public void setPreloadLangs(boolean preloadLangs)
true
and if tesseract is found, this will load the
langs that result from --list-langs. At parse time, the
parser will verify that tesseract has the requested lang
available.
If set to false
(the default) and tesseract is found, if a user
requests a language that tesseract does not have data for,
a TikaException will be thrown with tesseract's native exception
message, which is a bit less readable.
preloadLangs
- public boolean isPreloadLangs()
public TesseractOCRConfig getDefaultConfig()
Copyright © 2007–2022 The Apache Software Foundation. All rights reserved.