Package org.apache.tika.eval.app.tools
Class TopCommonTokenCounter
- java.lang.Object
- 
- org.apache.tika.eval.app.tools.TopCommonTokenCounter
 
- 
 public class TopCommonTokenCounter extends Object Utility class that reads in a UTF-8 input file with one document per row and outputs the 20000 tokens with the highest document frequencies.The CommmonTokensAnalyzer intentionally drops tokens shorter than 4 characters, but includes bigrams for cjk. It also has a include list for __email__ and __url__ and a skip list for common html markup terms. 
- 
- 
Constructor SummaryConstructors Constructor Description TopCommonTokenCounter()
 
-