Package org.apache.tika.langdetect.tika
Class LanguageIdentifier
java.lang.Object
org.apache.tika.langdetect.tika.LanguageIdentifier
Identifier of the language that best matches a given content profile.
The content profile is compared to generic language profiles based on
material from various sources.
- Since:
- Apache Tika 0.5
- See Also:
-
Constructor Summary
ConstructorDescriptionLanguageIdentifier
(String content) Constructs a language identifier based on a String of text contentLanguageIdentifier
(LanguageProfile profile) Constructs a language identifier based on a LanguageProfile -
Method Summary
Modifier and TypeMethodDescriptionstatic void
addProfile
(String language, LanguageProfile profile) Adds a single language profilestatic void
Clears the current map of language profilesstatic String
Returns a string of error messages related to initializing language profilesGets the identified languagefloat
1 - vector distance between the language model and the contentReturns what languages are supported for language identificationstatic boolean
Tests whether there were errors initializing language configstatic void
Builds the language profiles.static void
initProfiles
(Map<String, LanguageProfile> profilesMap) Initializes the language profiles from a user supplied initialized Map.boolean
Tries to judge whether the identification is certain enough to be trusted.toString()
-
Constructor Details
-
LanguageIdentifier
Constructs a language identifier based on a LanguageProfile- Parameters:
profile
- the language profile
-
LanguageIdentifier
Constructs a language identifier based on a String of text content- Parameters:
content
- the text
-
-
Method Details
-
addProfile
Adds a single language profile- Parameters:
language
- an ISO 639 code representing languageprofile
- the language profile
-
initProfiles
public static void initProfiles()Builds the language profiles. The list of languages are fetched from a property file named "tika.language.properties" If a file called "tika.language.override.properties" is found on classpath, this is used instead The property file contains a key "languages" with values being comma-separated language codes -
initProfiles
Initializes the language profiles from a user supplied initialized Map. This overrides the default set of profiles initialized at startup, and provides an alternative to configuring profiles through property file- Parameters:
profilesMap
- map of language profiles
-
clearProfiles
public static void clearProfiles()Clears the current map of language profiles -
hasErrors
public static boolean hasErrors()Tests whether there were errors initializing language config- Returns:
- true if there are errors. Use getErrors() to retrieve.
-
getErrors
Returns a string of error messages related to initializing language profiles- Returns:
- the String containing the error messages
-
getSupportedLanguages
Returns what languages are supported for language identification- Returns:
- A set of Strings being the ISO 639 language codes
-
getLanguage
Gets the identified language- Returns:
- an ISO 639 code representing the detected language
-
getRawScore
public float getRawScore()1 - vector distance between the language model and the content- Returns:
-
isReasonablyCertain
public boolean isReasonablyCertain()Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.- Returns:
true
if the distance is smaller then 0.022,false
otherwise
-
toString
-