|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.tika.language.LanguageProfilerBuilder
public class LanguageProfilerBuilder
This class runs a ngram analysis over submitted text, results might be used for automatic language identification. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.
Constructor Summary | |
---|---|
LanguageProfilerBuilder(String name)
Constructs a new ngram profile where minlen=3, maxlen=3 |
|
LanguageProfilerBuilder(String name,
int minlen,
int maxlen)
Constructs a new ngram profile |
Method Summary | |
---|---|
void |
add(StringBuffer word)
Adds ngrams from a single word to this profile |
void |
analyze(StringBuilder text)
Analyzes a piece of text |
static LanguageProfilerBuilder |
create(String name,
InputStream is,
String encoding)
Creates a new Language profile from (preferably quite large - 5-10k of lines) text file |
String |
getName()
|
float |
getSimilarity(LanguageProfilerBuilder another)
Calculates a score how well NGramProfiles match each other |
List<org.apache.tika.language.LanguageProfilerBuilder.NGramEntry> |
getSorted()
Returns a sorted list of ngrams (sort done by 1. |
void |
load(InputStream is)
Loads a ngram profile from an InputStream (assumes UTF-8 encoded content) |
static void |
main(String[] args)
main method used for testing only |
protected void |
normalize()
Normalizes the profile (calculates the ngrams frequencies) |
void |
save(OutputStream os)
Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding |
String |
toString()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public LanguageProfilerBuilder(String name, int minlen, int maxlen)
name
- is the name of the profileminlen
- is the min length of ngram sequencesmaxlen
- is the max length of ngram sequencespublic LanguageProfilerBuilder(String name)
name
- is a name of profile, usually two length stringMethod Detail |
---|
public String getName()
public void add(StringBuffer word)
word
- is the word to addpublic void analyze(StringBuilder text)
text
- the text to be analyzedprotected void normalize()
public List<org.apache.tika.language.LanguageProfilerBuilder.NGramEntry> getSorted()
public String toString()
toString
in class Object
public float getSimilarity(LanguageProfilerBuilder another) throws TikaException
another
- ngram profile to compare against
TikaException
- if could not calculate a scorepublic void load(InputStream is) throws IOException
is
- the InputStream to read
IOException
public static LanguageProfilerBuilder create(String name, InputStream is, String encoding) throws TikaException
name
- to be given for the profileis
- a stream to be readencoding
- is the encoding of stream
TikaException
- if could not create a language profilepublic void save(OutputStream os) throws IOException
os
- the Stream to output to
IOException
public static void main(String[] args)
args
-
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |