Class Tess4JParser

java.lang.Object
org.apache.tika.parser.ocr.tess4j.Tess4JParser
All Implemented Interfaces:
Serializable, Initializable, SelfConfiguring, Parser

public class Tess4JParser extends Object implements Parser, Initializable
OCR parser using Tess4J, which provides a Java JNA wrapper around the native Tesseract library.

Unlike the command-line TesseractOCRParser, this parser calls Tesseract in-process via JNA, eliminating the per-file process-spawn overhead.

Because the native Tesseract handle is not thread-safe, this parser maintains a configurable pool of Tesseract instances. The pool size is controlled by Tess4JConfig.setPoolSize(int).

Configuration key: "tess4j-parser"

Since:
Apache Tika 4.0
See Also:
  • Constructor Details

  • Method Details

    • getSupportedTypes

      public Set<MediaType> getSupportedTypes(ParseContext context)
      Description copied from interface: Parser
      Returns the set of media types supported by this parser when used with the given parse context.
      Specified by:
      getSupportedTypes in interface Parser
      Parameters:
      context - parse context
      Returns:
      immutable set of media types
    • parse

      public void parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext parseContext) throws IOException, SAXException, TikaException
      Description copied from interface: Parser
      Parses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.

      The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.

      Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.

      Specified by:
      parse in interface Parser
      handler - handler for the XHTML SAX events (output)
      metadata - document metadata (input and output)
      parseContext - parse context
      Throws:
      IOException - if the document stream could not be read
      SAXException - if the SAX events could not be processed
      TikaException - if the document could not be parsed
    • initialize

      public void initialize() throws TikaConfigException
      Description copied from interface: Initializable
      Called after all properties have been set to allow for validation and initialization that depends on multiple properties.
      Specified by:
      initialize in interface Initializable
      Throws:
      TikaConfigException - if there is a problem with the configuration
    • getLanguage

      public String getLanguage()
    • setLanguage

      public void setLanguage(String language)
    • getDataPath

      public String getDataPath()
    • setDataPath

      public void setDataPath(String dataPath) throws TikaConfigException
      Throws:
      TikaConfigException
    • getPageSegMode

      public int getPageSegMode()
    • setPageSegMode

      public void setPageSegMode(int pageSegMode)
    • getOcrEngineMode

      public int getOcrEngineMode()
    • setOcrEngineMode

      public void setOcrEngineMode(int ocrEngineMode)
    • getMaxFileSizeToOcr

      public long getMaxFileSizeToOcr()
    • setMaxFileSizeToOcr

      public void setMaxFileSizeToOcr(long maxFileSizeToOcr)
    • getMinFileSizeToOcr

      public long getMinFileSizeToOcr()
    • setMinFileSizeToOcr

      public void setMinFileSizeToOcr(long minFileSizeToOcr)
    • getPoolSize

      public int getPoolSize()
    • setPoolSize

      public void setPoolSize(int poolSize)
    • getTimeoutSeconds

      public int getTimeoutSeconds()
    • setTimeoutSeconds

      public void setTimeoutSeconds(int timeoutSeconds)
    • isSkipOcr

      public boolean isSkipOcr()
    • setSkipOcr

      public void setSkipOcr(boolean skipOcr)
    • getDpi

      public int getDpi()
    • setDpi

      public void setDpi(int dpi)
    • getNativeLibPath

      public String getNativeLibPath()
    • setNativeLibPath

      public void setNativeLibPath(String nativeLibPath) throws TikaConfigException
      Throws:
      TikaConfigException
    • getMaxImagePixels

      public long getMaxImagePixels()
    • setMaxImagePixels

      public void setMaxImagePixels(long maxImagePixels)
    • isInitialized

      public boolean isInitialized()
      Returns whether the parser has been successfully initialized (i.e., Tess4J native library is available).