Class TopCommonTokenCounter


  • public class TopCommonTokenCounter
    extends Object
    Utility class that reads in a UTF-8 input file with one document per row and outputs the 20000 tokens with the highest document frequencies.

    The CommmonTokensAnalyzer intentionally drops tokens shorter than 4 characters, but includes bigrams for cjk.

    It also has a include list for __email__ and __url__ and a skip list for common html markup terms.

    • Constructor Detail

      • TopCommonTokenCounter

        public TopCommonTokenCounter()