public class OpenNLPDetector extends LanguageDetector
This is based on OpenNLP's language detector. However, we've built our own ProbingLanguageDetector and our own language models.
To build our model, we followed OpenNLP's lead by using the (Leipzig corpus) as gathered and preprocessed ( big-data corpus ). We removed azj, plt, sun and zsm because our models couldn't sufficiently well distinguish them from related languages. We removed cmn in favor of the finer-grained zho-trad and zho-simp.We then added the following languages from cc-100: ben-rom (Bengali Romanized), ful, gla, gug, hau, hin-rom, ibo, ful, linm mya-zaw, nso, orm, quz, roh, srd, ssw, tam-rom, tel-rom, tsn, urd-rom, wol, yor.
We ran our own train/devtest/test code because OpenNLPs required more sentences/data than were available for some languages.
Please open an issue on our JIRA if we made mistakes and/or had misunderstandings in our design choices or if you need to have other languages added.
Citations for the cc-100 corpus:
Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), p. 8440-8451, July 2020, pdf, bib.
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020, pdf, bib.
mixedLanguages, shortText
Constructor and Description |
---|
OpenNLPDetector() |
Modifier and Type | Method and Description |
---|---|
void |
addText(char[] cbuf,
int off,
int len)
This will buffer up to
setMaxLength(int) and then
ignore the rest of the text. |
List<LanguageResult> |
detectAll()
Detect languages based on previously submitted text (via addText calls).
|
String[] |
getSupportedLanguages() |
boolean |
hasModel(String language)
Provide information about whether a model exists for a specific
language.
|
LanguageDetector |
loadModels()
No-op.
|
LanguageDetector |
loadModels(Set<String> languages)
NOT SUPPORTED.
|
void |
reset()
Reset statistics about the current document being processed
|
void |
setMaxLength(int maxLength) |
LanguageDetector |
setPriors(Map<String,Float> languageProbabilities)
NOT YET SUPPORTED.
|
addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, hasEnoughText, isMixedLanguages, isShortText, setMixedLanguages, setShortText
public LanguageDetector loadModels() throws IOException
loadModels
in class LanguageDetector
IOException
public LanguageDetector loadModels(Set<String> languages) throws IOException
UnsupportedOperationException
loadModels
in class LanguageDetector
languages
- list of target languages.IOException
public boolean hasModel(String language)
LanguageDetector
hasModel
in class LanguageDetector
language
- ISO 639-1 name for languagepublic LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
UnsupportedOperationException
setPriors
in class LanguageDetector
languageProbabilities
- Map from language to probabilityIOException
public void reset()
LanguageDetector
reset
in class LanguageDetector
public void addText(char[] cbuf, int off, int len)
setMaxLength(int)
and then
ignore the rest of the text.addText
in class LanguageDetector
cbuf
- Character bufferoff
- Offset into cbuf to first character in the run of textlen
- Number of characters in the run of text.public List<LanguageResult> detectAll()
LanguageDetector
detectAll
in class LanguageDetector
public void setMaxLength(int maxLength)
public String[] getSupportedLanguages()
Copyright © 2007–2022 The Apache Software Foundation. All rights reserved.