public class LanguageProfilerBuilder extends Object
The similarity calculation is at experimental level. You have been warned.
Methods are provided to build new NGramProfiles profiles.
| Constructor and Description |
|---|
LanguageProfilerBuilder(String name)
Constructs a new ngram profile where minlen=3, maxlen=3
|
LanguageProfilerBuilder(String name,
int minlen,
int maxlen)
Constructs a new ngram profile
|
| Modifier and Type | Method and Description |
|---|---|
void |
add(StringBuffer word)
Adds ngrams from a single word to this profile
|
void |
analyze(StringBuilder text)
Analyzes a piece of text
|
static LanguageProfilerBuilder |
create(String name,
InputStream is,
String encoding)
Creates a new Language profile from (preferably quite large - 5-10k of
lines) text file
|
String |
getName() |
float |
getSimilarity(LanguageProfilerBuilder another)
Calculates a score how well NGramProfiles match each other
|
List<org.apache.tika.langdetect.tika.LanguageProfilerBuilder.NGramEntry> |
getSorted()
Returns a sorted list of ngrams (sort done by 1.
|
void |
load(InputStream is)
Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
|
static void |
main(String[] args)
main method used for testing only
|
protected void |
normalize()
Normalizes the profile (calculates the ngrams frequencies)
|
void |
save(OutputStream os)
Writes NGramProfile content into OutputStream, content is outputted with
UTF-8 encoding
|
String |
toString() |
public LanguageProfilerBuilder(String name, int minlen, int maxlen)
name - is the name of the profileminlen - is the min length of ngram sequencesmaxlen - is the max length of ngram sequencespublic LanguageProfilerBuilder(String name)
name - is a name of profile, usually two length stringpublic static LanguageProfilerBuilder create(String name, InputStream is, String encoding) throws TikaException
name - to be given for the profileis - a stream to be readencoding - is the encoding of streamTikaException - if could not create a language profilepublic static void main(String[] args)
args - public String getName()
public void add(StringBuffer word)
word - is the word to addpublic void analyze(StringBuilder text)
text - the text to be analyzedprotected void normalize()
public List<org.apache.tika.langdetect.tika.LanguageProfilerBuilder.NGramEntry> getSorted()
public float getSimilarity(LanguageProfilerBuilder another) throws TikaException
another - ngram profile to compare againstTikaException - if could not calculate a scorepublic void load(InputStream is) throws IOException
is - the InputStream to readIOExceptionpublic void save(OutputStream os) throws IOException
os - the Stream to output toIOExceptionCopyright © 2007–2022 The Apache Software Foundation. All rights reserved.