Package org.apache.tika.language.detect
Class LanguageDetector
- java.lang.Object
- 
- org.apache.tika.language.detect.LanguageDetector
 
- 
- Direct Known Subclasses:
- Lingo24LangDetector,- OpenNLPDetector,- OptimaizeLangDetector,- TextLangDetector,- TikaLanguageDetector
 
 public abstract class LanguageDetector extends Object 
- 
- 
Field SummaryFields Modifier and Type Field Description protected booleanmixedLanguagesprotected booleanshortText
 - 
Constructor SummaryConstructors Constructor Description LanguageDetector()
 - 
Method SummaryAll Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract voidaddText(char[] cbuf, int off, int len)Add statistics about this text for the current document.voidaddText(CharSequence text)Addto the statistics being accumulated for the current document. LanguageResultdetect()LanguageResultdetect(CharSequence text)abstract List<LanguageResult>detectAll()Detect languages based on previously submitted text (via addText calls).List<LanguageResult>detectAll(String text)Utility wrapper that detects the language of a given chunk of text.static LanguageDetectorgetDefaultLanguageDetector()static List<LanguageDetector>getLanguageDetectors()static List<LanguageDetector>getLanguageDetectors(ServiceLoader loader)booleanhasEnoughText()Tell the caller whether more text is required for the current document before the language can be reliably detected.abstract booleanhasModel(String language)Provide information about whether a model exists for a specific language.booleanisMixedLanguages()booleanisShortText()abstract LanguageDetectorloadModels()Load (or re-load) all available language models.abstract LanguageDetectorloadModels(Set<String> languages)Load (or re-load) the models specified in. abstract voidreset()Reset statistics about the current document being processedLanguageDetectorsetMixedLanguages(boolean mixedLanguages)abstract LanguageDetectorsetPriors(Map<String,Float> languageProbabilities)Set the a-priori probabilities for these languages.LanguageDetectorsetShortText(boolean shortText)
 
- 
- 
- 
Method Detail- 
getDefaultLanguageDetectorpublic static LanguageDetector getDefaultLanguageDetector() 
 - 
getLanguageDetectorspublic static List<LanguageDetector> getLanguageDetectors() 
 - 
getLanguageDetectorspublic static List<LanguageDetector> getLanguageDetectors(ServiceLoader loader) 
 - 
isMixedLanguagespublic boolean isMixedLanguages() 
 - 
setMixedLanguagespublic LanguageDetector setMixedLanguages(boolean mixedLanguages) 
 - 
isShortTextpublic boolean isShortText() 
 - 
setShortTextpublic LanguageDetector setShortText(boolean shortText) 
 - 
loadModelspublic abstract LanguageDetector loadModels() throws IOException Load (or re-load) all available language models. This must be called after any settings that would impact the models being loaded (e.g. mixed language/short text), but before any of the document processing routines (below) are called. Note that it only needs to be called once.- Returns:
- this
- Throws:
- IOException
 
 - 
loadModelspublic abstract LanguageDetector loadModels(Set<String> languages) throws IOException Load (or re-load) the models specified in. These use the ISO 639-1 names, with an optional "- " for more specific specification (e.g. "zh-CN" for Chinese in China). - Parameters:
- languages- list of target languages.
- Returns:
- this
- Throws:
- IOException
 
 - 
hasModelpublic abstract boolean hasModel(String language) Provide information about whether a model exists for a specific language.- Parameters:
- language- ISO 639-1 name for language
- Returns:
- true if a model for this language exists.
 
 - 
setPriorspublic abstract LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException Set the a-priori probabilities for these languages. The provided map uses the language as the key, and the probability (0.0 > probability < 1.0) of text being in that language. Note that if the probabilities don't sum to 1.0, these values will be normalized.If hasModel() returns false for any of the languages, an IllegalArgumentException is thrown. Use of these probabilities is detector-specific, and thus might not impact the results at all. As such, these should be viewed as a hint. - Parameters:
- languageProbabilities- Map from language to probability
- Returns:
- this
- Throws:
- IOException
 
 - 
resetpublic abstract void reset() Reset statistics about the current document being processed
 - 
addTextpublic abstract void addText(char[] cbuf, int off, int len)Add statistics about this text for the current document. Note that we assume an implicit word break exists before/after each of these runs of text.- Parameters:
- cbuf- Character buffer
- off- Offset into cbuf to first character in the run of text
- len- Number of characters in the run of text.
 
 - 
addTextpublic void addText(CharSequence text) Addto the statistics being accumulated for the current document. Note that this is a default implementation for adding a string (not optimized) - Parameters:
- text- Characters to add to current statistics.
 
 - 
hasEnoughTextpublic boolean hasEnoughText() Tell the caller whether more text is required for the current document before the language can be reliably detected.Implementations can override this to do early termination of stats collection, which can improve performance with longer documents. Note that detect() can be called even when this returns false - Returns:
- true if we have enough text for reliable detection.
 
 - 
detectAllpublic abstract List<LanguageResult> detectAll() Detect languages based on previously submitted text (via addText calls).- Returns:
- list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest. There will always be at least one result, which might have a confidence of NONE.
 
 - 
detectpublic LanguageResult detect() 
 - 
detectAllpublic List<LanguageResult> detectAll(String text) Utility wrapper that detects the language of a given chunk of text.- Parameters:
- text- String to add to current statistics.
- Returns:
- list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest.
 
 - 
detectpublic LanguageResult detect(CharSequence text) 
 
- 
 
-