public class TikaLanguageDetector extends LanguageDetector
Because it works only on trigrams, it is not suitable for short texts.
There are better performing language detectors. This module is still here in the hopes that we'll get around to improving it, because it is elegant and could be fairly trivially improved.
mixedLanguages, shortText
Constructor and Description |
---|
TikaLanguageDetector() |
Modifier and Type | Method and Description |
---|---|
void |
addText(char[] cbuf,
int off,
int len)
Add statistics about this text for the current document.
|
List<LanguageResult> |
detectAll()
Detect languages based on previously submitted text (via addText calls).
|
boolean |
hasModel(String language)
Provide information about whether a model exists for a specific
language.
|
LanguageDetector |
loadModels()
Load (or re-load) all available language models.
|
LanguageDetector |
loadModels(Set<String> languages)
Load (or re-load) the models specified in
|
void |
reset()
Reset statistics about the current document being processed
|
LanguageDetector |
setPriors(Map<String,Float> languageProbabilities)
not supported
|
addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, hasEnoughText, isMixedLanguages, isShortText, setMixedLanguages, setShortText
public LanguageDetector loadModels() throws IOException
LanguageDetector
loadModels
in class LanguageDetector
IOException
public LanguageDetector loadModels(Set<String> languages) throws IOException
LanguageDetector
loadModels
in class LanguageDetector
languages
- list of target languages.IOException
public boolean hasModel(String language)
LanguageDetector
hasModel
in class LanguageDetector
language
- ISO 639-1 name for languagepublic LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
setPriors
in class LanguageDetector
languageProbabilities
- Map from language to probabilityIOException
public void reset()
LanguageDetector
reset
in class LanguageDetector
public void addText(char[] cbuf, int off, int len)
LanguageDetector
addText
in class LanguageDetector
cbuf
- Character bufferoff
- Offset into cbuf to first character in the run of textlen
- Number of characters in the run of text.public List<LanguageResult> detectAll()
LanguageDetector
detectAll
in class LanguageDetector
Copyright © 2007–2022 The Apache Software Foundation. All rights reserved.