Package org.apache.tika.parser.ocrencode
Class EncodeOCRConfig
java.lang.Object
org.apache.tika.parser.ocrencode.EncodeOCRConfig
- All Implemented Interfaces:
Serializable
Configuration for
EncodeOCRParser. This parser base64-encodes image
bytes into the XHTML output instead of running OCR text extraction locally,
so the size/count limits below govern which images are accepted for
encoding, not text recognition.
The *Ocr field and setter names are retained to keep the
tika-config/JSON parameter names stable; treat them as "OCR-encode".
This class is not thread safe and must be synchronized externally.
- See Also:
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionlongintlongbooleanbooleanvoidsetInlineContent(boolean inlineContent) voidsetMaxFileSizeToOcr(long maxFileSizeToOcr) Set the maximum image size (in bytes) accepted for base64 encoding.voidsetMaxImagesToOcr(int maxImagesToOcr) Sets the maximum number of images to base64-encode per parse (across the whole document, tracked via ParseContext).voidsetMinFileSizeToOcr(long minFileSizeToOcr) Set the minimum image size (in bytes) accepted for base64 encoding.voidsetSkipOcr(boolean skipOcr) If set totrue, disables base64 encoding at runtime: the parser reports no supported types and parse() is a no-op.
-
Field Details
-
DEFAULT_MAX_FILE_SIZE_TO_OCR
public static final long DEFAULT_MAX_FILE_SIZE_TO_OCR- See Also:
-
-
Constructor Details
-
EncodeOCRConfig
public EncodeOCRConfig()
-
-
Method Details
-
setInlineContent
public void setInlineContent(boolean inlineContent) -
isInlineContent
public boolean isInlineContent() -
getMinFileSizeToOcr
public long getMinFileSizeToOcr() -
setMinFileSizeToOcr
public void setMinFileSizeToOcr(long minFileSizeToOcr) Set the minimum image size (in bytes) accepted for base64 encoding. Images smaller than this are skipped. Default is 0 (no lower bound). -
getMaxFileSizeToOcr
public long getMaxFileSizeToOcr() -
setMaxFileSizeToOcr
public void setMaxFileSizeToOcr(long maxFileSizeToOcr) Set the maximum image size (in bytes) accepted for base64 encoding. Images larger than this are skipped. Default is 104857600L bytes (100 MB). -
isSkipOcr
public boolean isSkipOcr() -
setSkipOcr
public void setSkipOcr(boolean skipOcr) If set totrue, disables base64 encoding at runtime: the parser reports no supported types and parse() is a no-op. Use this to turn the parser off for a specific file without rewiring tika-config.- Parameters:
skipOcr-
-
getMaxImagesToOcr
public int getMaxImagesToOcr() -
setMaxImagesToOcr
public void setMaxImagesToOcr(int maxImagesToOcr) Sets the maximum number of images to base64-encode per parse (across the whole document, tracked via ParseContext). Further images beyond this count are skipped. Default is 50.- Parameters:
maxImagesToOcr- maximum number of images to encode; must be >= 0
-