Package org.apache.tika.langdetect.tika
Class TikaLanguageDetector
- java.lang.Object
-
- org.apache.tika.language.detect.LanguageDetector
-
- org.apache.tika.langdetect.tika.TikaLanguageDetector
-
public class TikaLanguageDetector extends LanguageDetector
This is Tika's original legacy, homegrown language detector. As it is currently implemented, it computes vector distance of trigrams between input string and language models.Because it works only on trigrams, it is not suitable for short texts.
There are better performing language detectors. This module is still here in the hopes that we'll get around to improving it, because it is elegant and could be fairly trivially improved.
-
-
Field Summary
-
Fields inherited from class org.apache.tika.language.detect.LanguageDetector
mixedLanguages, shortText
-
-
Constructor Summary
Constructors Constructor Description TikaLanguageDetector()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddText(char[] cbuf, int off, int len)Add statistics about this text for the current document.List<LanguageResult>detectAll()Detect languages based on previously submitted text (via addText calls).booleanhasModel(String language)Provide information about whether a model exists for a specific language.LanguageDetectorloadModels()Load (or re-load) all available language models.LanguageDetectorloadModels(Set<String> languages)Load (or re-load) the models specified in. voidreset()Reset statistics about the current document being processedLanguageDetectorsetPriors(Map<String,Float> languageProbabilities)not supported-
Methods inherited from class org.apache.tika.language.detect.LanguageDetector
addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, hasEnoughText, isMixedLanguages, isShortText, setMixedLanguages, setShortText
-
-
-
-
Method Detail
-
loadModels
public LanguageDetector loadModels() throws IOException
Description copied from class:LanguageDetectorLoad (or re-load) all available language models. This must be called after any settings that would impact the models being loaded (e.g. mixed language/short text), but before any of the document processing routines (below) are called. Note that it only needs to be called once.- Specified by:
loadModelsin classLanguageDetector- Returns:
- this
- Throws:
IOException
-
loadModels
public LanguageDetector loadModels(Set<String> languages) throws IOException
Description copied from class:LanguageDetectorLoad (or re-load) the models specified in. These use the ISO 639-1 names, with an optional "- " for more specific specification (e.g. "zh-CN" for Chinese in China). - Specified by:
loadModelsin classLanguageDetector- Parameters:
languages- list of target languages.- Returns:
- this
- Throws:
IOException
-
hasModel
public boolean hasModel(String language)
Description copied from class:LanguageDetectorProvide information about whether a model exists for a specific language.- Specified by:
hasModelin classLanguageDetector- Parameters:
language- ISO 639-1 name for language- Returns:
- true if a model for this language exists.
-
setPriors
public LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
not supported- Specified by:
setPriorsin classLanguageDetector- Parameters:
languageProbabilities- Map from language to probability- Returns:
- Throws:
IOException
-
reset
public void reset()
Description copied from class:LanguageDetectorReset statistics about the current document being processed- Specified by:
resetin classLanguageDetector
-
addText
public void addText(char[] cbuf, int off, int len)Description copied from class:LanguageDetectorAdd statistics about this text for the current document. Note that we assume an implicit word break exists before/after each of these runs of text.- Specified by:
addTextin classLanguageDetector- Parameters:
cbuf- Character bufferoff- Offset into cbuf to first character in the run of textlen- Number of characters in the run of text.
-
detectAll
public List<LanguageResult> detectAll()
Description copied from class:LanguageDetectorDetect languages based on previously submitted text (via addText calls).- Specified by:
detectAllin classLanguageDetector- Returns:
- list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest. There will always be at least one result, which might have a confidence of NONE.
-
-