Class Tess4JParser
java.lang.Object
org.apache.tika.parser.ocr.tess4j.Tess4JParser
- All Implemented Interfaces:
Serializable,Initializable,SelfConfiguring,Parser
OCR parser using Tess4J,
which provides a Java JNA wrapper around the native Tesseract library.
Unlike the command-line TesseractOCRParser, this parser calls Tesseract
in-process via JNA, eliminating the per-file process-spawn overhead.
Because the native Tesseract handle is not thread-safe, this parser
maintains a configurable pool of Tesseract instances. The pool size
is controlled by Tess4JConfig.setPoolSize(int).
Configuration key: "tess4j-parser"
- Since:
- Apache Tika 4.0
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionintgetDpi()longlonglongintintintgetSupportedTypes(ParseContext context) Returns the set of media types supported by this parser when used with the given parse context.intvoidCalled after all properties have been set to allow for validation and initialization that depends on multiple properties.booleanReturns whether the parser has been successfully initialized (i.e., Tess4J native library is available).booleanvoidparse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext parseContext) Parses a document stream into a sequence of XHTML SAX events.voidsetDataPath(String dataPath) voidsetDpi(int dpi) voidsetLanguage(String language) voidsetMaxFileSizeToOcr(long maxFileSizeToOcr) voidsetMaxImagePixels(long maxImagePixels) voidsetMinFileSizeToOcr(long minFileSizeToOcr) voidsetNativeLibPath(String nativeLibPath) voidsetOcrEngineMode(int ocrEngineMode) voidsetPageSegMode(int pageSegMode) voidsetPoolSize(int poolSize) voidsetSkipOcr(boolean skipOcr) voidsetTimeoutSeconds(int timeoutSeconds)
-
Constructor Details
-
Tess4JParser
- Throws:
TikaConfigException
-
Tess4JParser
- Throws:
TikaConfigException
-
Tess4JParser
- Throws:
TikaConfigException
-
-
Method Details
-
getSupportedTypes
Description copied from interface:ParserReturns the set of media types supported by this parser when used with the given parse context.- Specified by:
getSupportedTypesin interfaceParser- Parameters:
context- parse context- Returns:
- immutable set of media types
-
parse
public void parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext parseContext) throws IOException, SAXException, TikaException Description copied from interface:ParserParses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
- Specified by:
parsein interfaceParserhandler- handler for the XHTML SAX events (output)metadata- document metadata (input and output)parseContext- parse context- Throws:
IOException- if the document stream could not be readSAXException- if the SAX events could not be processedTikaException- if the document could not be parsed
-
initialize
Description copied from interface:InitializableCalled after all properties have been set to allow for validation and initialization that depends on multiple properties.- Specified by:
initializein interfaceInitializable- Throws:
TikaConfigException- if there is a problem with the configuration
-
getLanguage
-
setLanguage
-
getDataPath
-
setDataPath
- Throws:
TikaConfigException
-
getPageSegMode
public int getPageSegMode() -
setPageSegMode
public void setPageSegMode(int pageSegMode) -
getOcrEngineMode
public int getOcrEngineMode() -
setOcrEngineMode
public void setOcrEngineMode(int ocrEngineMode) -
getMaxFileSizeToOcr
public long getMaxFileSizeToOcr() -
setMaxFileSizeToOcr
public void setMaxFileSizeToOcr(long maxFileSizeToOcr) -
getMinFileSizeToOcr
public long getMinFileSizeToOcr() -
setMinFileSizeToOcr
public void setMinFileSizeToOcr(long minFileSizeToOcr) -
getPoolSize
public int getPoolSize() -
setPoolSize
public void setPoolSize(int poolSize) -
getTimeoutSeconds
public int getTimeoutSeconds() -
setTimeoutSeconds
public void setTimeoutSeconds(int timeoutSeconds) -
isSkipOcr
public boolean isSkipOcr() -
setSkipOcr
public void setSkipOcr(boolean skipOcr) -
getDpi
public int getDpi() -
setDpi
public void setDpi(int dpi) -
getNativeLibPath
-
setNativeLibPath
- Throws:
TikaConfigException
-
getMaxImagePixels
public long getMaxImagePixels() -
setMaxImagePixels
public void setMaxImagePixels(long maxImagePixels) -
isInitialized
public boolean isInitialized()Returns whether the parser has been successfully initialized (i.e., Tess4J native library is available).
-