Class OptimaizeLangDetector
java.lang.Object
org.apache.tika.language.detect.LanguageDetector
org.apache.tika.langdetect.optimaize.OptimaizeLangDetector
Implementation of the LanguageDetector API that uses
https://github.com/optimaize/language-detector
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final int
Fields inherited from class org.apache.tika.language.detect.LanguageDetector
mixedLanguages, shortText
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
addText
(char[] cbuf, int off, int len) Add statistics about this text for the current document.Detect languages based on previously submitted text (via addText calls).boolean
Tell the caller whether more text is required for the current document before the language can be reliably detected.boolean
Provide information about whether a model exists for a specific language.Load (or re-load) all available language models.loadModels
(Set<String> languages) Load (or re-load) the models specified in. void
reset()
Reset statistics about the current document being processedSet the a-priori probabilities for these languages.Methods inherited from class org.apache.tika.language.detect.LanguageDetector
addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, isMixedLanguages, isShortText, setMixedLanguages, setShortText
-
Field Details
-
DEFAULT_MAX_CHARS_FOR_DETECTION
public static final int DEFAULT_MAX_CHARS_FOR_DETECTION- See Also:
-
DEFAULT_MAX_CHARS_FOR_SHORT_DETECTION
public static final int DEFAULT_MAX_CHARS_FOR_SHORT_DETECTION- See Also:
-
-
Constructor Details
-
OptimaizeLangDetector
public OptimaizeLangDetector() -
OptimaizeLangDetector
public OptimaizeLangDetector(int maxCharsForDetection)
-
-
Method Details
-
loadModels
Description copied from class:LanguageDetector
Load (or re-load) all available language models. This must be called after any settings that would impact the models being loaded (e.g. mixed language/short text), but before any of the document processing routines (below) are called. Note that it only needs to be called once.- Specified by:
loadModels
in classLanguageDetector
- Returns:
- this
-
loadModels
Description copied from class:LanguageDetector
Load (or re-load) the models specified in. These use the ISO 639-1 names, with an optional "- " for more specific specification (e.g. "zh-CN" for Chinese in China). - Specified by:
loadModels
in classLanguageDetector
- Parameters:
languages
- list of target languages.- Returns:
- this
- Throws:
IOException
-
hasModel
Description copied from class:LanguageDetector
Provide information about whether a model exists for a specific language.- Specified by:
hasModel
in classLanguageDetector
- Parameters:
language
- ISO 639-1 name for language- Returns:
- true if a model for this language exists.
-
setPriors
Description copied from class:LanguageDetector
Set the a-priori probabilities for these languages. The provided map uses the language as the key, and the probability (0.0 > probability < 1.0) of text being in that language. Note that if the probabilities don't sum to 1.0, these values will be normalized.If hasModel() returns false for any of the languages, an IllegalArgumentException is thrown.
Use of these probabilities is detector-specific, and thus might not impact the results at all. As such, these should be viewed as a hint.
- Specified by:
setPriors
in classLanguageDetector
- Parameters:
languageProbabilities
- Map from language to probability- Returns:
- this
- Throws:
IOException
-
reset
public void reset()Description copied from class:LanguageDetector
Reset statistics about the current document being processed- Specified by:
reset
in classLanguageDetector
-
addText
public void addText(char[] cbuf, int off, int len) Description copied from class:LanguageDetector
Add statistics about this text for the current document. Note that we assume an implicit word break exists before/after each of these runs of text.- Specified by:
addText
in classLanguageDetector
- Parameters:
cbuf
- Character bufferoff
- Offset into cbuf to first character in the run of textlen
- Number of characters in the run of text.
-
detectAll
Detect languages based on previously submitted text (via addText calls).- Specified by:
detectAll
in classLanguageDetector
- Returns:
- the detected list of languages
- Throws:
IllegalStateException
- if no models have been loaded withloadModels()
orloadModels(java.util.Set)
-
hasEnoughText
public boolean hasEnoughText()Description copied from class:LanguageDetector
Tell the caller whether more text is required for the current document before the language can be reliably detected.Implementations can override this to do early termination of stats collection, which can improve performance with longer documents.
Note that detect() can be called even when this returns false
- Overrides:
hasEnoughText
in classLanguageDetector
- Returns:
- true if we have enough text for reliable detection.
-