org.apache.tika.language.detect.LanguageDetector

org.apache.tika.langdetect.charsoup.CharSoupLanguageDetector

All Implemented Interfaces:: SelfConfiguring

public class CharSoupLanguageDetector extends LanguageDetector implements SelfConfiguring

CharSoup language detector using INT8-quantized multinomial logistic regression trained on Wikipedia (primary corpus) with MADLAD supplements for thin languages.

Text is buffered via addText(char[], int, int) up to CharSoupFeatureExtractor.MAX_TEXT_LENGTH characters. At detectAll() time, the buffer is evaluated in independent 5000-character chunks. Each chunk runs the full preprocessing pipeline (truncate → strip URLs/emails → NFC normalize → extract bigram features → score via raw logits). If the first chunk produces high entropy (indicating junk, code, or non-language content), the next chunk is tried. The result from the chunk with the lowest entropy is returned. This avoids polluting the language signal with leading junk while keeping the implementation simple and predictable.

Inference uses raw logits throughout — no softmax distribution is ever computed. Confidence is based on the margin between the top two logits after confusable-group collapsing: sigmoid(top_logit − second_logit). This is invariant to the number of classes and provides a stable confidence signal from short snippets up to full documents. Per-class rawScore is sigmoid(logit_c − best_competitor_logit): the winner gets a value above 0.5, all others below.

Field Summary

Fields inherited from class org.apache.tika.language.detect.LanguageDetector
mixedLanguages, shortText
Constructor Summary

Constructors

Constructor

Description

CharSoupLanguageDetector()

Constructs a detector using the default classpath-loaded model.

CharSoupLanguageDetector(CharSoupModel customModel)

Constructs a detector that uses a caller-supplied model instead of the classpath default.
Method Summary

Modifier and Type

Method

Description

void

addText(char[] cbuf, int off, int len)

Add statistics about this text for the current document.

<K> K

compareLanguageSignal(Map<K,String> candidates)

Compare multiple candidate texts and return the key of the one with the strongest language signal.

List<LanguageResult>

detectAll()

Detect languages based on previously submitted text (via addText calls).

float

getDistributionEntropy()

Returns the Shannon entropy (in bits) of the probability distribution from the most recent detectAll() call, or Float.NaN if detectAll() has not been called since the last reset().

CharSoupModel

getModel()

Returns the model this detector instance is using for predictions.

static Set<String>

getSupportedLanguages()

Returns all language codes supported by the loaded model.

boolean

hasEnoughText()

Tell the caller whether more text is required for the current document before the language can be reliably detected.

boolean

hasModel(String language)

Provide information about whether a model exists for a specific language.

LanguageDetector

loadModels()

Load (or re-load) all available language models.

LanguageDetector

loadModels(Set<String> languages)

Load (or re-load) the models specified in .

void

reset()

Reset statistics about the current document being processed.

void

setMaxLength(int maxLength)

Sets the maximum text length (in characters) that will be buffered for detection.

LanguageDetector

setPriors(Map<String,Float> languageProbabilities)

Set the a-priori probabilities for these languages.

static List<String>

topShortTextLanguages(String text, int n)

Return the top n language codes from the short-text discriminative model, ranked by raw logit (descending).

Methods inherited from class org.apache.tika.language.detect.LanguageDetector
addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, isMixedLanguages, isShortText, reset, setMixedLanguages, setShortText

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- CharSoupLanguageDetector
  
  public CharSoupLanguageDetector()
  
  Constructs a detector using the default classpath-loaded model.
- CharSoupLanguageDetector
  
  public CharSoupLanguageDetector(CharSoupModel customModel)
  
  Constructs a detector that uses a caller-supplied model instead of the classpath default. This ensures evaluations and comparisons always run against the intended model binary — not whatever happens to be on the classpath.
  
  Parameters:
  
  customModel - the model to use for all predictions
Method Details
- getDistributionEntropy
  
  public float getDistributionEntropy()
  
  Returns the Shannon entropy (in bits) of the probability distribution from the most recent detectAll() call, or Float.NaN if detectAll() has not been called since the last reset().
  This can be used as a junk/garbage detector: high entropy (> 4.0 bits) indicates the model has no confident prediction, which typically means the input is not natural language text.
  
  Returns:
  
  entropy in bits, or Float.NaN
- compareLanguageSignal
  
  public <K> K compareLanguageSignal(Map<K,String> candidates)
  
  Compare multiple candidate texts and return the key of the one with the strongest language signal. Candidates with a high ratio of replacement or control characters are discarded first. Remaining candidates are scored using sigmoid(top_logit − second_logit) — the margin between the top two classes, invariant to the number of classes in the model.
  Returns null if no candidate exceeds the minimum confidence threshold, indicating the comparison is inconclusive.
  
  Type Parameters:
  
  K - key type (e.g., Charset)
  
  Parameters:
  
  candidates - map of arbitrary keys to candidate text strings
  
  Returns:
  
  the key whose text has the strongest language signal, or null if the map is empty or no candidate is confident enough
- topShortTextLanguages
  
  public static List<String> topShortTextLanguages(String text, int n)
  
  Return the top n language codes from the short-text discriminative model, ranked by raw logit (descending).
  Unlike detectAll(), this method applies no entropy or confidence thresholds — it always returns the model's ranking even when the distribution is flat. This is useful for downstream generative-model confirmation on very short text (e.g. zip entry filenames) where the discriminative model alone is inconclusive but its top candidates still contain a useful language signal.
  
  Parameters:
  
  text - the decoded text to classify
  
  n - maximum number of language codes to return
  
  Returns:
  
  top language codes, or empty list if the short-text model is not loaded or text is empty
- loadModels
  
  public LanguageDetector loadModels() throws IOException
  
  Description copied from class: LanguageDetector
  
  Load (or re-load) all available language models. This must be called after any settings that would impact the models being loaded (e.g. mixed language/short text), but before any of the document processing routines (below) are called. Note that it only needs to be called once.
  
  Specified by:
  
  loadModels in class LanguageDetector
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- loadModels
  
  public LanguageDetector loadModels(Set<String> languages) throws IOException
  
  Description copied from class: LanguageDetector
  
  Load (or re-load) the models specified in . These use the ISO 639-1 names, with an optional "-" for more specific specification (e.g. "zh-CN" for Chinese in China).
  
  Specified by:
  
  loadModels in class LanguageDetector
  
  Parameters:
  
  languages - list of target languages.
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- hasModel
  
  public boolean hasModel(String language)
  
  Description copied from class: LanguageDetector
  
  Provide information about whether a model exists for a specific language.
  
  Specified by:
  
  hasModel in class LanguageDetector
  
  Parameters:
  
  language - ISO 639-1 name for language
  
  Returns:
  
  true if a model for this language exists.
- getSupportedLanguages
  
  public static Set<String> getSupportedLanguages()
  
  Returns all language codes supported by the loaded model.
  
  Returns:
  
  unmodifiable set of ISO 639-3 language codes
- getModel
  
  public CharSoupModel getModel()
  
  Returns the model this detector instance is using for predictions. Useful for verification in evaluation tools.
- setMaxLength
  
  public void setMaxLength(int maxLength)
  
  Sets the maximum text length (in characters) that will be buffered for detection. Text beyond this limit is silently discarded.
  The default limit is CharSoupFeatureExtractor.MAX_TEXT_LENGTH (100,000 characters).
  
  Parameters:
  
  maxLength - maximum number of characters to buffer
- setPriors
  
  public LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
  
  Description copied from class: LanguageDetector
  
  Set the a-priori probabilities for these languages. The provided map uses the language as the key, and the probability (0.0 > probability < 1.0) of text being in that language. Note that if the probabilities don't sum to 1.0, these values will be normalized.
  If hasModel() returns false for any of the languages, an IllegalArgumentException is thrown.
  Use of these probabilities is detector-specific, and thus might not impact the results at all. As such, these should be viewed as a hint.
  
  Specified by:
  
  setPriors in class LanguageDetector
  
  Parameters:
  
  languageProbabilities - Map from language to probability
  
  Returns:
  
  this
  
  Throws:
  
  IOException
- reset
  
  public void reset()
  
  Description copied from class: LanguageDetector
  
  Reset statistics about the current document being processed.
  
  Specified by:
  
  reset in class LanguageDetector
- addText
  
  public void addText(char[] cbuf, int off, int len)
  
  Description copied from class: LanguageDetector
  
  Add statistics about this text for the current document. Note that we assume an implicit word break exists before/after each of these runs of text.
  
  Specified by:
  
  addText in class LanguageDetector
  
  Parameters:
  
  cbuf - Character buffer
  
  off - Offset into cbuf to first character in the run of text
  
  len - Number of characters in the run of text.
- hasEnoughText
  
  public boolean hasEnoughText()
  
  Description copied from class: LanguageDetector
  
  Tell the caller whether more text is required for the current document before the language can be reliably detected.
  Implementations can override this to do early termination of stats collection, which can improve performance with longer documents.
  Note that detect() can be called even when this returns false
  
  Overrides:
  
  hasEnoughText in class LanguageDetector
  
  Returns:
  
  true if we have enough text for reliable detection.
- detectAll
  
  public List<LanguageResult> detectAll()
  
  Description copied from class: LanguageDetector
  
  Detect languages based on previously submitted text (via addText calls).
  
  Specified by:
  
  detectAll in class LanguageDetector
  
  Returns:
  
  list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest. There will always be at least one result, which might have a confidence of NONE.

Class CharSoupLanguageDetector

Field Summary

Fields inherited from class org.apache.tika.language.detect.LanguageDetector

Constructor Summary

Method Summary

Methods inherited from class org.apache.tika.language.detect.LanguageDetector

Methods inherited from class java.lang.Object

Constructor Details

CharSoupLanguageDetector

CharSoupLanguageDetector

Method Details

getDistributionEntropy

compareLanguageSignal

topShortTextLanguages

loadModels

loadModels

hasModel

getSupportedLanguages

getModel

setMaxLength

setPriors

reset

addText

hasEnoughText

detectAll