Class LanguageIdentifier

    • Constructor Detail

      • LanguageIdentifier

        public LanguageIdentifier​(LanguageProfile profile)
        Constructs a language identifier based on a LanguageProfile
        Parameters:
        profile - the language profile
      • LanguageIdentifier

        public LanguageIdentifier​(String content)
        Constructs a language identifier based on a String of text content
        Parameters:
        content - the text
    • Method Detail

      • addProfile

        public static void addProfile​(String language,
                                      LanguageProfile profile)
        Adds a single language profile
        Parameters:
        language - an ISO 639 code representing language
        profile - the language profile
      • initProfiles

        public static void initProfiles()
        Builds the language profiles. The list of languages are fetched from a property file named "tika.language.properties" If a file called "tika.language.override.properties" is found on classpath, this is used instead The property file contains a key "languages" with values being comma-separated language codes
      • initProfiles

        public static void initProfiles​(Map<String,​LanguageProfile> profilesMap)
        Initializes the language profiles from a user supplied initialized Map. This overrides the default set of profiles initialized at startup, and provides an alternative to configuring profiles through property file
        Parameters:
        profilesMap - map of language profiles
      • clearProfiles

        public static void clearProfiles()
        Clears the current map of language profiles
      • hasErrors

        public static boolean hasErrors()
        Tests whether there were errors initializing language config
        Returns:
        true if there are errors. Use getErrors() to retrieve.
      • getErrors

        public static String getErrors()
        Returns a string of error messages related to initializing language profiles
        Returns:
        the String containing the error messages
      • getSupportedLanguages

        public static Set<String> getSupportedLanguages()
        Returns what languages are supported for language identification
        Returns:
        A set of Strings being the ISO 639 language codes
      • getLanguage

        public String getLanguage()
        Gets the identified language
        Returns:
        an ISO 639 code representing the detected language
      • getRawScore

        public float getRawScore()
        1 - vector distance between the language model and the content
        Returns:
      • isReasonablyCertain

        public boolean isReasonablyCertain()
        Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.
        Returns:
        true if the distance is smaller then 0.022, false otherwise