org.apache.tika.language
Class LanguageIdentifier

java.lang.Object
  extended by org.apache.tika.language.LanguageIdentifier

public class LanguageIdentifier
extends Object

Identifier of the language that best matches a given content profile. The content profile is compared to generic language profiles based on material from various sources.

Since:
Apache Tika 0.5
See Also:
Europarl: A Parallel Corpus for Statistical Machine Translation, ISO 639 Language Codes

Constructor Summary
LanguageIdentifier(LanguageProfile profile)
          Constructs a language identifier based on a LanguageProfile
LanguageIdentifier(String content)
          Constructs a language identifier based on a String of text content
 
Method Summary
static void addProfile(String language, LanguageProfile profile)
          Adds a single language profile
static void clearProfiles()
          Clears the current map of language profiles
static String getErrors()
          Returns a string of error messages related to initializing langauge profiles
 String getLanguage()
          Gets the identified language
static Set<String> getSupportedLanguages()
          Returns what languages are supported for language identification
static boolean hasErrors()
          Tests whether there were errors initializing language config
static void initProfiles()
          Builds the language profiles.
static void initProfiles(Map<String,LanguageProfile> profilesMap)
          Initializes the language profiles from a user supplied initialized Map.
 boolean isReasonablyCertain()
          Tries to judge whether the identification is certain enough to be trusted.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

LanguageIdentifier

public LanguageIdentifier(LanguageProfile profile)
Constructs a language identifier based on a LanguageProfile

Parameters:
profile - the language profile

LanguageIdentifier

public LanguageIdentifier(String content)
Constructs a language identifier based on a String of text content

Parameters:
content - the text
Method Detail

addProfile

public static void addProfile(String language,
                              LanguageProfile profile)
Adds a single language profile

Parameters:
language - an ISO 639 code representing language
profile - the language profile

getLanguage

public String getLanguage()
Gets the identified language

Returns:
an ISO 639 code representing the detected language

isReasonablyCertain

public boolean isReasonablyCertain()
Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.

Returns:
true if the distance is smaller then , false otherwise

initProfiles

public static void initProfiles()
Builds the language profiles. The list of languages are fetched from a property file named "tika.language.properties" If a file called "tika.language.override.properties" is found on classpath, this is used instead The property file contains a key "languages" with values being comma-separated language codes


initProfiles

public static void initProfiles(Map<String,LanguageProfile> profilesMap)
Initializes the language profiles from a user supplied initialized Map. This overrides the default set of profiles initialized at startup, and provides an alternative to configuring profiles through property file

Parameters:
profilesMap - map of language profiles

clearProfiles

public static void clearProfiles()
Clears the current map of language profiles


hasErrors

public static boolean hasErrors()
Tests whether there were errors initializing language config

Returns:
true if there are errors. Use getErrors() to retrieve.

getErrors

public static String getErrors()
Returns a string of error messages related to initializing langauge profiles

Returns:
the String containing the error messages

getSupportedLanguages

public static Set<String> getSupportedLanguages()
Returns what languages are supported for language identification

Returns:
A set of Strings being the ISO 639 language codes

toString

public String toString()
Overrides:
toString in class Object


Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.