org.apache.tika.language
Class LanguageIdentifier

java.lang.Object
  extended by org.apache.tika.language.LanguageIdentifier

public class LanguageIdentifier
extends java.lang.Object

Identifier of the language that best matches a given content profile. The content profile is compared to generic language profiles based on material from various sources.

Since:
Apache Tika 0.5
See Also:
Europarl: A Parallel Corpus for Statistical Machine Translation, ISO 639 Language Codes

Constructor Summary
LanguageIdentifier(LanguageProfile profile)
          Constructs a language identifier based on a LanguageProfile
LanguageIdentifier(java.lang.String content)
          Constructs a language identifier based on a String of text content
 
Method Summary
static void addProfile(java.lang.String language, LanguageProfile profile)
          Adds a single language profile
static void clearProfiles()
          Clears the current map of language profiles
static java.lang.String getErrors()
          Returns a string of error messages related to initializing langauge profiles
 java.lang.String getLanguage()
          Gets the identified language
static java.util.Set<java.lang.String> getSupportedLanguages()
          Returns what languages are supported for language identification
static boolean hasErrors()
          Tests whether there were errors initializing language config
static void initProfiles()
          Builds the language profiles.
static void initProfiles(java.util.Map<java.lang.String,LanguageProfile> profilesMap)
          Initializes the language profiles from a user supplied initilized Map This overrides the default set of profiles initialized at startup, and provides an alternative to configuring profiles through property file
 boolean isReasonablyCertain()
          Tries to judge whether the identification is certain enough to be trusted.
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

LanguageIdentifier

public LanguageIdentifier(LanguageProfile profile)
Constructs a language identifier based on a LanguageProfile

Parameters:
profile -

LanguageIdentifier

public LanguageIdentifier(java.lang.String content)
Constructs a language identifier based on a String of text content

Parameters:
content -
Method Detail

addProfile

public static void addProfile(java.lang.String language,
                              LanguageProfile profile)
Adds a single language profile

Parameters:
language - an ISO 639 code representing language
profile -

getLanguage

public java.lang.String getLanguage()
Gets the identified language

Returns:
an ISO 639 code representing the detected language

isReasonablyCertain

public boolean isReasonablyCertain()
Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.

Returns:

initProfiles

public static void initProfiles()
Builds the language profiles. The list of languages are fetched from a property file named "tika.language.properties" If a file called "tika.language.override.properties" is found on classpath, this is used instead The property file contains a key "languages" with values being comma-separated language codes


initProfiles

public static void initProfiles(java.util.Map<java.lang.String,LanguageProfile> profilesMap)
Initializes the language profiles from a user supplied initilized Map This overrides the default set of profiles initialized at startup, and provides an alternative to configuring profiles through property file


clearProfiles

public static void clearProfiles()
Clears the current map of language profiles


hasErrors

public static boolean hasErrors()
Tests whether there were errors initializing language config

Returns:
true if there are errors. Use getErrors() to retrieve.

getErrors

public static java.lang.String getErrors()
Returns a string of error messages related to initializing langauge profiles

Returns:

getSupportedLanguages

public static java.util.Set<java.lang.String> getSupportedLanguages()
Returns what languages are supported for language identification

Returns:
A set of Strings being the ISO 639 language codes

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object


Copyright © 2007-2010 The Apache Software Foundation. All Rights Reserved.