org.apache.tika.language
Class LanguageProfilerBuilder

java.lang.Object
  extended by org.apache.tika.language.LanguageProfilerBuilder

public class LanguageProfilerBuilder
extends Object

This class runs a ngram analysis over submitted text, results might be used for automatic language identification. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.

Author:
Sami Siren, Jerome Charron - http://frutch.free.fr/

Constructor Summary
LanguageProfilerBuilder(String name)
          Constructs a new ngram profile where minlen=3, maxlen=3
LanguageProfilerBuilder(String name, int minlen, int maxlen)
          Constructs a new ngram profile
 
Method Summary
 void add(StringBuffer word)
          Adds ngrams from a single word to this profile
 void analyze(StringBuilder text)
          Analyzes a piece of text
static LanguageProfilerBuilder create(String name, InputStream is, String encoding)
          Creates a new Language profile from (preferably quite large - 5-10k of lines) text file
 String getName()
           
 float getSimilarity(LanguageProfilerBuilder another)
          Calculates a score how well NGramProfiles match each other
 List<org.apache.tika.language.LanguageProfilerBuilder.NGramEntry> getSorted()
          Returns a sorted list of ngrams (sort done by 1.
 void load(InputStream is)
          Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
static void main(String[] args)
          main method used for testing only
protected  void normalize()
          Normalizes the profile (calculates the ngrams frequencies)
 void save(OutputStream os)
          Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

LanguageProfilerBuilder

public LanguageProfilerBuilder(String name,
                               int minlen,
                               int maxlen)
Constructs a new ngram profile

Parameters:
name - is the name of the profile
minlen - is the min length of ngram sequences
maxlen - is the max length of ngram sequences

LanguageProfilerBuilder

public LanguageProfilerBuilder(String name)
Constructs a new ngram profile where minlen=3, maxlen=3

Parameters:
name - is a name of profile, usually two length string
Since:
Tika 1.0
Method Detail

getName

public String getName()
Returns:
Returns the name.

add

public void add(StringBuffer word)
Adds ngrams from a single word to this profile

Parameters:
word - is the word to add

analyze

public void analyze(StringBuilder text)
Analyzes a piece of text

Parameters:
text - the text to be analyzed

normalize

protected void normalize()
Normalizes the profile (calculates the ngrams frequencies)


getSorted

public List<org.apache.tika.language.LanguageProfilerBuilder.NGramEntry> getSorted()
Returns a sorted list of ngrams (sort done by 1. frequency 2. sequence)

Returns:
sorted vector of ngrams

toString

public String toString()
Overrides:
toString in class Object

getSimilarity

public float getSimilarity(LanguageProfilerBuilder another)
                    throws TikaException
Calculates a score how well NGramProfiles match each other

Parameters:
another - ngram profile to compare against
Returns:
similarity 0=exact match
Throws:
TikaException - if could not calculate a score

load

public void load(InputStream is)
          throws IOException
Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)

Parameters:
is - the InputStream to read
Throws:
IOException

create

public static LanguageProfilerBuilder create(String name,
                                             InputStream is,
                                             String encoding)
                                      throws TikaException
Creates a new Language profile from (preferably quite large - 5-10k of lines) text file

Parameters:
name - to be given for the profile
is - a stream to be read
encoding - is the encoding of stream
Throws:
TikaException - if could not create a language profile

save

public void save(OutputStream os)
          throws IOException
Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding

Parameters:
os - the Stream to output to
Throws:
IOException

main

public static void main(String[] args)
main method used for testing only

Parameters:
args -


Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.