Class TextProfileSignature
java.lang.Object
org.apache.tika.eval.core.textstats.TextProfileSignature
- All Implemented Interfaces:
TextStatsCalculator
,TokenCountStatsCalculator<String>
Copied nearly directly from Apache Nutch:
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextProfileSignature.java
See documentation: https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/TextProfileSignature.html
This returns the base32 encoded sha256
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptioncalculate
(TokenCounts tokenCounts) void
setMinTokenLength
(int minTokenLength) Be careful -- for CJK languages, the default analyzer uses character bigrams.void
setQuantRate
(float quantRate)
-
Constructor Details
-
TextProfileSignature
public TextProfileSignature()
-
-
Method Details
-
calculate
- Specified by:
calculate
in interfaceTokenCountStatsCalculator<String>
-
setMinTokenLength
public void setMinTokenLength(int minTokenLength) Be careful -- for CJK languages, the default analyzer uses character bigrams. You will "ignore" all cjk language tokens if you set minTokenLength > 2!- Parameters:
minTokenLength
- -- include tokens of this length or greater.
-
setQuantRate
public void setQuantRate(float quantRate)
-