java.lang.Object

org.apache.tika.language.detect.LanguageDetector

Direct Known Subclasses:: Lingo24LangDetector, OpenNLPDetector, OptimaizeLangDetector, TextLangDetector, TikaLanguageDetector

public abstract class LanguageDetector extends Object

Field Summary

Fields

Modifier and Type

Field

Description

protected boolean

mixedLanguages

protected boolean

shortText
Constructor Summary

Constructors

Constructor

Description

LanguageDetector()
Method Summary

Modifier and Type

Method

Description

abstract void

addText(char[] cbuf, int off, int len)

Add statistics about this text for the current document.

void

addText(CharSequence text)

Add to the statistics being accumulated for the current document.

LanguageResult

detect()

LanguageResult

detect(CharSequence text)

abstract List<LanguageResult>

detectAll()

Detect languages based on previously submitted text (via addText calls).

List<LanguageResult>

detectAll(String text)

Utility wrapper that detects the language of a given chunk of text.

static LanguageDetector

getDefaultLanguageDetector()

static List<LanguageDetector>

getLanguageDetectors()

static List<LanguageDetector>

getLanguageDetectors(ServiceLoader loader)

boolean

hasEnoughText()

Tell the caller whether more text is required for the current document before the language can be reliably detected.

abstract boolean

hasModel(String language)

Provide information about whether a model exists for a specific language.

boolean

isMixedLanguages()

boolean

isShortText()

abstract LanguageDetector

loadModels()

Load (or re-load) all available language models.

abstract LanguageDetector

loadModels(Set<String> languages)

Load (or re-load) the models specified in .

abstract void

reset()

Reset statistics about the current document being processed

LanguageDetector

setMixedLanguages(boolean mixedLanguages)

abstract LanguageDetector

setPriors(Map<String,Float> languageProbabilities)

Set the a-priori probabilities for these languages.

LanguageDetector

setShortText(boolean shortText)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- mixedLanguages
  
  protected boolean mixedLanguages
- shortText
  
  protected boolean shortText
Constructor Details
- LanguageDetector
  
  public LanguageDetector()
Method Details
- getDefaultLanguageDetector
  
  public static LanguageDetector getDefaultLanguageDetector()
- getLanguageDetectors
  
  public static List<LanguageDetector> getLanguageDetectors()
- getLanguageDetectors
  
  public static List<LanguageDetector> getLanguageDetectors(ServiceLoader loader)
- isMixedLanguages
  
  public boolean isMixedLanguages()
- setMixedLanguages
  
  public LanguageDetector setMixedLanguages(boolean mixedLanguages)
- isShortText
  
  public boolean isShortText()
- setShortText
  
  public LanguageDetector setShortText(boolean shortText)
- loadModels
  
  public abstract LanguageDetector loadModels() throws IOException
  
  Load (or re-load) all available language models. This must be called after any settings that would impact the models being loaded (e.g. mixed language/short text), but before any of the document processing routines (below) are called. Note that it only needs to be called once.
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- loadModels
  
  public abstract LanguageDetector loadModels(Set<String> languages) throws IOException
  
  Load (or re-load) the models specified in . These use the ISO 639-1 names, with an optional "-" for more specific specification (e.g. "zh-CN" for Chinese in China).
  
  Parameters:
  
  languages - list of target languages.
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- hasModel
  
  public abstract boolean hasModel(String language)
  
  Provide information about whether a model exists for a specific language.
  
  Parameters:
  
  language - ISO 639-1 name for language
  
  Returns:
  
  true if a model for this language exists.
- setPriors
  
  public abstract LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
  
  Set the a-priori probabilities for these languages. The provided map uses the language as the key, and the probability (0.0 > probability < 1.0) of text being in that language. Note that if the probabilities don't sum to 1.0, these values will be normalized.
  If hasModel() returns false for any of the languages, an IllegalArgumentException is thrown.
  Use of these probabilities is detector-specific, and thus might not impact the results at all. As such, these should be viewed as a hint.
  
  Parameters:
  
  languageProbabilities - Map from language to probability
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- reset
  
  public abstract void reset()
  
  Reset statistics about the current document being processed
- addText
  
  public abstract void addText(char[] cbuf, int off, int len)
  
  Add statistics about this text for the current document. Note that we assume an implicit word break exists before/after each of these runs of text.
  
  Parameters:
  
  cbuf - Character buffer
  
  off - Offset into cbuf to first character in the run of text
  
  len - Number of characters in the run of text.
- addText
  
  public void addText(CharSequence text)
  
  Add to the statistics being accumulated for the current document. Note that this is a default implementation for adding a string (not optimized)
  
  Parameters:
  
  text - Characters to add to current statistics.
- hasEnoughText
  
  public boolean hasEnoughText()
  
  Tell the caller whether more text is required for the current document before the language can be reliably detected.
  Implementations can override this to do early termination of stats collection, which can improve performance with longer documents.
  Note that detect() can be called even when this returns false
  
  Returns:
  
  true if we have enough text for reliable detection.
- detectAll
  
  public abstract List<LanguageResult> detectAll()
  
  Detect languages based on previously submitted text (via addText calls).
  
  Returns:
  
  list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest. There will always be at least one result, which might have a confidence of NONE.
- detect
  
  public LanguageResult detect()
- detectAll
  
  public List<LanguageResult> detectAll(String text)
  
  Utility wrapper that detects the language of a given chunk of text.
  
  Parameters:
  
  text - String to add to current statistics.
  
  Returns:
  
  list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest.
- detect
  
  public LanguageResult detect(CharSequence text)

Class LanguageDetector

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

mixedLanguages

shortText

Constructor Details

LanguageDetector

Method Details

getDefaultLanguageDetector

getLanguageDetectors

getLanguageDetectors

isMixedLanguages

setMixedLanguages

isShortText

setShortText

loadModels

loadModels

hasModel

setPriors

reset

addText

addText

hasEnoughText

detectAll

detect

detectAll

detect