Package org.apache.tika.langdetect.tika
Class LanguageIdentifier
- java.lang.Object
-
- org.apache.tika.langdetect.tika.LanguageIdentifier
-
public class LanguageIdentifier extends Object
Identifier of the language that best matches a given content profile. The content profile is compared to generic language profiles based on material from various sources.- Since:
- Apache Tika 0.5
- See Also:
- Europarl: A Parallel Corpus for Statistical Machine Translation, ISO 639 Language Codes
-
-
Constructor Summary
Constructors Constructor Description LanguageIdentifier(String content)
Constructs a language identifier based on a String of text contentLanguageIdentifier(LanguageProfile profile)
Constructs a language identifier based on a LanguageProfile
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static void
addProfile(String language, LanguageProfile profile)
Adds a single language profilestatic void
clearProfiles()
Clears the current map of language profilesstatic String
getErrors()
Returns a string of error messages related to initializing language profilesString
getLanguage()
Gets the identified languagefloat
getRawScore()
1 - vector distance between the language model and the contentstatic Set<String>
getSupportedLanguages()
Returns what languages are supported for language identificationstatic boolean
hasErrors()
Tests whether there were errors initializing language configstatic void
initProfiles()
Builds the language profiles.static void
initProfiles(Map<String,LanguageProfile> profilesMap)
Initializes the language profiles from a user supplied initialized Map.boolean
isReasonablyCertain()
Tries to judge whether the identification is certain enough to be trusted.String
toString()
-
-
-
Constructor Detail
-
LanguageIdentifier
public LanguageIdentifier(LanguageProfile profile)
Constructs a language identifier based on a LanguageProfile- Parameters:
profile
- the language profile
-
LanguageIdentifier
public LanguageIdentifier(String content)
Constructs a language identifier based on a String of text content- Parameters:
content
- the text
-
-
Method Detail
-
addProfile
public static void addProfile(String language, LanguageProfile profile)
Adds a single language profile- Parameters:
language
- an ISO 639 code representing languageprofile
- the language profile
-
initProfiles
public static void initProfiles()
Builds the language profiles. The list of languages are fetched from a property file named "tika.language.properties" If a file called "tika.language.override.properties" is found on classpath, this is used instead The property file contains a key "languages" with values being comma-separated language codes
-
initProfiles
public static void initProfiles(Map<String,LanguageProfile> profilesMap)
Initializes the language profiles from a user supplied initialized Map. This overrides the default set of profiles initialized at startup, and provides an alternative to configuring profiles through property file- Parameters:
profilesMap
- map of language profiles
-
clearProfiles
public static void clearProfiles()
Clears the current map of language profiles
-
hasErrors
public static boolean hasErrors()
Tests whether there were errors initializing language config- Returns:
- true if there are errors. Use getErrors() to retrieve.
-
getErrors
public static String getErrors()
Returns a string of error messages related to initializing language profiles- Returns:
- the String containing the error messages
-
getSupportedLanguages
public static Set<String> getSupportedLanguages()
Returns what languages are supported for language identification- Returns:
- A set of Strings being the ISO 639 language codes
-
getLanguage
public String getLanguage()
Gets the identified language- Returns:
- an ISO 639 code representing the detected language
-
getRawScore
public float getRawScore()
1 - vector distance between the language model and the content- Returns:
-
isReasonablyCertain
public boolean isReasonablyCertain()
Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.- Returns:
true
if the distance is smaller then 0.022,false
otherwise
-
-