Package org.apache.tika.eval.app.tools
Class TopCommonTokenCounter
java.lang.Object
org.apache.tika.eval.app.tools.TopCommonTokenCounter
Utility class that reads in a UTF-8 input file with one document per row
and outputs the 20000 tokens with the highest document frequencies.
The CommmonTokensAnalyzer intentionally drops tokens shorter than 4 characters, but includes bigrams for cjk.
It also has a include list for __email__ and __url__ and a skip list for common html markup terms.
-
Constructor Summary
-
Method Summary
-
Constructor Details
-
TopCommonTokenCounter
public TopCommonTokenCounter()
-
-
Method Details
-
main
- Throws:
Exception
-