Class LanguageProfilerBuilder


  • public class LanguageProfilerBuilder
    extends Object
    This class runs a ngram analysis over submitted text, results might be used for automatic language identification.

    The similarity calculation is at experimental level. You have been warned.

    Methods are provided to build new NGramProfiles profiles.

    Author:
    Sami Siren, Jerome Charron - http://frutch.free.fr/
    • Constructor Detail

      • LanguageProfilerBuilder

        public LanguageProfilerBuilder​(String name,
                                       int minlen,
                                       int maxlen)
        Constructs a new ngram profile
        Parameters:
        name - is the name of the profile
        minlen - is the min length of ngram sequences
        maxlen - is the max length of ngram sequences
      • LanguageProfilerBuilder

        public LanguageProfilerBuilder​(String name)
        Constructs a new ngram profile where minlen=3, maxlen=3
        Parameters:
        name - is a name of profile, usually two length string
        Since:
        Tika 1.0
    • Method Detail

      • create

        public static LanguageProfilerBuilder create​(String name,
                                                     InputStream is,
                                                     String encoding)
                                              throws TikaException
        Creates a new Language profile from (preferably quite large - 5-10k of lines) text file
        Parameters:
        name - to be given for the profile
        is - a stream to be read
        encoding - is the encoding of stream
        Throws:
        TikaException - if could not create a language profile
      • main

        public static void main​(String[] args)
        main method used for testing only
        Parameters:
        args -
      • getName

        public String getName()
        Returns:
        Returns the name.
      • add

        public void add​(StringBuffer word)
        Adds ngrams from a single word to this profile
        Parameters:
        word - is the word to add
      • analyze

        public void analyze​(StringBuilder text)
        Analyzes a piece of text
        Parameters:
        text - the text to be analyzed
      • normalize

        protected void normalize()
        Normalizes the profile (calculates the ngrams frequencies)
      • getSorted

        public List<org.apache.tika.langdetect.tika.LanguageProfilerBuilder.NGramEntry> getSorted()
        Returns a sorted list of ngrams (sort done by 1. frequency 2. sequence)
        Returns:
        sorted vector of ngrams
      • getSimilarity

        public float getSimilarity​(LanguageProfilerBuilder another)
                            throws TikaException
        Calculates a score how well NGramProfiles match each other
        Parameters:
        another - ngram profile to compare against
        Returns:
        similarity 0=exact match
        Throws:
        TikaException - if could not calculate a score
      • load

        public void load​(InputStream is)
                  throws IOException
        Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
        Parameters:
        is - the InputStream to read
        Throws:
        IOException
      • save

        public void save​(OutputStream os)
                  throws IOException
        Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
        Parameters:
        os - the Stream to output to
        Throws:
        IOException