java.lang.Object

org.apache.tika.language.detect.LanguageDetector

org.apache.tika.langdetect.mitll.TextLangDetector

public class TextLangDetector extends LanguageDetector

Language Detection using MIT Lincoln Lab’s Text.jl library https://github.com/trevorlewis/TextREST.jl

Please run the TextREST.jl server before using this.

Field Summary

Fields inherited from class org.apache.tika.language.detect.LanguageDetector
mixedLanguages, shortText
Constructor Summary

Constructors

Constructor

Description

TextLangDetector()
Method Summary

Modifier and Type

Method

Description

void

addText(char[] cbuf, int off, int len)

Add statistics about this text for the current document.

protected static boolean

canRun()

List<LanguageResult>

detectAll()

Detect languages based on previously submitted text (via addText calls).

boolean

hasModel(String language)

Provide information about whether a model exists for a specific language.

LanguageDetector

loadModels()

Load (or re-load) all available language models.

LanguageDetector

loadModels(Set<String> set)

Load (or re-load) the models specified in .

void

reset()

Reset statistics about the current document being processed

LanguageDetector

setPriors(Map<String,Float> languageProbabilities)

Set the a-priori probabilities for these languages.

Methods inherited from class org.apache.tika.language.detect.LanguageDetector
addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, hasEnoughText, isMixedLanguages, isShortText, setMixedLanguages, setShortText

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TextLangDetector
  
  public TextLangDetector()
Method Details
- canRun
  
  protected static boolean canRun()
- loadModels
  
  public LanguageDetector loadModels() throws IOException
  
  Description copied from class: LanguageDetector
  
  Load (or re-load) all available language models. This must be called after any settings that would impact the models being loaded (e.g. mixed language/short text), but before any of the document processing routines (below) are called. Note that it only needs to be called once.
  
  Specified by:
  
  loadModels in class LanguageDetector
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- loadModels
  
  public LanguageDetector loadModels(Set<String> set) throws IOException
  
  Description copied from class: LanguageDetector
  
  Load (or re-load) the models specified in . These use the ISO 639-1 names, with an optional "-" for more specific specification (e.g. "zh-CN" for Chinese in China).
  
  Specified by:
  
  loadModels in class LanguageDetector
  
  Parameters:
  
  set - list of target languages.
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- hasModel
  
  public boolean hasModel(String language)
  
  Description copied from class: LanguageDetector
  
  Provide information about whether a model exists for a specific language.
  
  Specified by:
  
  hasModel in class LanguageDetector
  
  Parameters:
  
  language - ISO 639-1 name for language
  
  Returns:
  
  true if a model for this language exists.
- setPriors
  
  public LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
  
  Description copied from class: LanguageDetector
  
  Set the a-priori probabilities for these languages. The provided map uses the language as the key, and the probability (0.0 > probability < 1.0) of text being in that language. Note that if the probabilities don't sum to 1.0, these values will be normalized.
  If hasModel() returns false for any of the languages, an IllegalArgumentException is thrown.
  Use of these probabilities is detector-specific, and thus might not impact the results at all. As such, these should be viewed as a hint.
  
  Specified by:
  
  setPriors in class LanguageDetector
  
  Parameters:
  
  languageProbabilities - Map from language to probability
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- reset
  
  public void reset()
  
  Description copied from class: LanguageDetector
  
  Reset statistics about the current document being processed
  
  Specified by:
  
  reset in class LanguageDetector
- addText
  
  public void addText(char[] cbuf, int off, int len)
  
  Description copied from class: LanguageDetector
  
  Add statistics about this text for the current document. Note that we assume an implicit word break exists before/after each of these runs of text.
  
  Specified by:
  
  addText in class LanguageDetector
  
  Parameters:
  
  cbuf - Character buffer
  
  off - Offset into cbuf to first character in the run of text
  
  len - Number of characters in the run of text.
- detectAll
  
  public List<LanguageResult> detectAll()
  
  Description copied from class: LanguageDetector
  
  Detect languages based on previously submitted text (via addText calls).
  
  Specified by:
  
  detectAll in class LanguageDetector
  
  Returns:
  
  list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest. There will always be at least one result, which might have a confidence of NONE.

Class TextLangDetector

Field Summary

Fields inherited from class org.apache.tika.language.detect.LanguageDetector

Constructor Summary

Method Summary

Methods inherited from class org.apache.tika.language.detect.LanguageDetector

Methods inherited from class java.lang.Object

Constructor Details

TextLangDetector

Method Details

canRun

loadModels

loadModels

hasModel

setPriors

reset

addText

detectAll