Class OpenNLPDetector
- java.lang.Object
-
- org.apache.tika.language.detect.LanguageDetector
-
- org.apache.tika.langdetect.opennlp.OpenNLPDetector
-
public class OpenNLPDetector extends LanguageDetector
This is based on OpenNLP's language detector. However, we've built our own ProbingLanguageDetector and our own language models.
To build our model, we followed OpenNLP's lead by using the (Leipzig corpus) as gathered and preprocessed ( big-data corpus ). We removed azj, plt, sun and zsm because our models couldn't sufficiently well distinguish them from related languages. We removed cmn in favor of the finer-grained zho-trad and zho-simp.We then added the following languages from cc-100: ben-rom (Bengali Romanized), ful, gla, gug, hau, hin-rom, ibo, ful, linm mya-zaw, nso, orm, quz, roh, srd, ssw, tam-rom, tel-rom, tsn, urd-rom, wol, yor.
We ran our own train/devtest/test code because OpenNLPs required more sentences/data than were available for some languages.
Please open an issue on our JIRA if we made mistakes and/or had misunderstandings in our design choices or if you need to have other languages added.
Citations for the cc-100 corpus:
Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), p. 8440-8451, July 2020, pdf, bib.
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020, pdf, bib.
-
-
Field Summary
-
Fields inherited from class org.apache.tika.language.detect.LanguageDetector
mixedLanguages, shortText
-
-
Constructor Summary
Constructors Constructor Description OpenNLPDetector()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addText(char[] cbuf, int off, int len)
This will buffer up tosetMaxLength(int)
and then ignore the rest of the text.List<LanguageResult>
detectAll()
Detect languages based on previously submitted text (via addText calls).String[]
getSupportedLanguages()
boolean
hasModel(String language)
Provide information about whether a model exists for a specific language.LanguageDetector
loadModels()
No-op.LanguageDetector
loadModels(Set<String> languages)
NOT SUPPORTED.void
reset()
Reset statistics about the current document being processedvoid
setMaxLength(int maxLength)
LanguageDetector
setPriors(Map<String,Float> languageProbabilities)
NOT YET SUPPORTED.-
Methods inherited from class org.apache.tika.language.detect.LanguageDetector
addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, hasEnoughText, isMixedLanguages, isShortText, setMixedLanguages, setShortText
-
-
-
-
Method Detail
-
loadModels
public LanguageDetector loadModels() throws IOException
No-op. Models are loaded statically.- Specified by:
loadModels
in classLanguageDetector
- Returns:
- Throws:
IOException
-
loadModels
public LanguageDetector loadModels(Set<String> languages) throws IOException
NOT SUPPORTED. ThrowsUnsupportedOperationException
- Specified by:
loadModels
in classLanguageDetector
- Parameters:
languages
- list of target languages.- Returns:
- Throws:
IOException
-
hasModel
public boolean hasModel(String language)
Description copied from class:LanguageDetector
Provide information about whether a model exists for a specific language.- Specified by:
hasModel
in classLanguageDetector
- Parameters:
language
- ISO 639-1 name for language- Returns:
- true if a model for this language exists.
-
setPriors
public LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
NOT YET SUPPORTED. ThrowsUnsupportedOperationException
- Specified by:
setPriors
in classLanguageDetector
- Parameters:
languageProbabilities
- Map from language to probability- Returns:
- Throws:
IOException
-
reset
public void reset()
Description copied from class:LanguageDetector
Reset statistics about the current document being processed- Specified by:
reset
in classLanguageDetector
-
addText
public void addText(char[] cbuf, int off, int len)
This will buffer up tosetMaxLength(int)
and then ignore the rest of the text.- Specified by:
addText
in classLanguageDetector
- Parameters:
cbuf
- Character bufferoff
- Offset into cbuf to first character in the run of textlen
- Number of characters in the run of text.
-
detectAll
public List<LanguageResult> detectAll()
Description copied from class:LanguageDetector
Detect languages based on previously submitted text (via addText calls).- Specified by:
detectAll
in classLanguageDetector
- Returns:
- list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest. There will always be at least one result, which might have a confidence of NONE.
-
setMaxLength
public void setMaxLength(int maxLength)
-
getSupportedLanguages
public String[] getSupportedLanguages()
-
-