Class LanguageProfilerBuilder

java.lang.Object
org.apache.tika.langdetect.tika.LanguageProfilerBuilder

public class LanguageProfilerBuilder extends Object
This class runs a ngram analysis over submitted text, results might be used for automatic language identification.

The similarity calculation is at experimental level. You have been warned.

Methods are provided to build new NGramProfiles profiles.

Author:
Sami Siren, Jerome Charron - http://frutch.free.fr/
  • Constructor Details

    • LanguageProfilerBuilder

      public LanguageProfilerBuilder(String name, int minlen, int maxlen)
      Constructs a new ngram profile
      Parameters:
      name - is the name of the profile
      minlen - is the min length of ngram sequences
      maxlen - is the max length of ngram sequences
    • LanguageProfilerBuilder

      public LanguageProfilerBuilder(String name)
      Constructs a new ngram profile where minlen=3, maxlen=3
      Parameters:
      name - is a name of profile, usually two length string
      Since:
      Tika 1.0
  • Method Details

    • create

      public static LanguageProfilerBuilder create(String name, InputStream is, String encoding) throws TikaException
      Creates a new Language profile from (preferably quite large - 5-10k of lines) text file
      Parameters:
      name - to be given for the profile
      is - a stream to be read
      encoding - is the encoding of stream
      Throws:
      TikaException - if could not create a language profile
    • main

      public static void main(String[] args)
      main method used for testing only
      Parameters:
      args -
    • getName

      public String getName()
      Returns:
      Returns the name.
    • add

      public void add(StringBuffer word)
      Adds ngrams from a single word to this profile
      Parameters:
      word - is the word to add
    • analyze

      public void analyze(StringBuilder text)
      Analyzes a piece of text
      Parameters:
      text - the text to be analyzed
    • normalize

      protected void normalize()
      Normalizes the profile (calculates the ngrams frequencies)
    • getSorted

      public List<org.apache.tika.langdetect.tika.LanguageProfilerBuilder.NGramEntry> getSorted()
      Returns a sorted list of ngrams (sort done by 1. frequency 2. sequence)
      Returns:
      sorted vector of ngrams
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getSimilarity

      public float getSimilarity(LanguageProfilerBuilder another) throws TikaException
      Calculates a score how well NGramProfiles match each other
      Parameters:
      another - ngram profile to compare against
      Returns:
      similarity 0=exact match
      Throws:
      TikaException - if could not calculate a score
    • load

      public void load(InputStream is) throws IOException
      Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
      Parameters:
      is - the InputStream to read
      Throws:
      IOException
    • save

      public void save(OutputStream os) throws IOException
      Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
      Parameters:
      os - the Stream to output to
      Throws:
      IOException