Class TextProfileSignature

java.lang.Object
org.apache.tika.eval.core.textstats.TextProfileSignature
All Implemented Interfaces:
TextStatsCalculator, TokenCountStatsCalculator<String>

public class TextProfileSignature extends Object implements TokenCountStatsCalculator<String>
Copied nearly directly from Apache Nutch: https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextProfileSignature.java

See documentation: https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/TextProfileSignature.html

This returns the base32 encoded sha256

  • Constructor Details

    • TextProfileSignature

      public TextProfileSignature()
  • Method Details

    • calculate

      public String calculate(TokenCounts tokenCounts)
      Specified by:
      calculate in interface TokenCountStatsCalculator<String>
    • setMinTokenLength

      public void setMinTokenLength(int minTokenLength)
      Be careful -- for CJK languages, the default analyzer uses character bigrams. You will "ignore" all cjk language tokens if you set minTokenLength > 2!
      Parameters:
      minTokenLength - -- include tokens of this length or greater.
    • setQuantRate

      public void setQuantRate(float quantRate)