Utility class that reads in a UTF-8 input file with one document per row
and outputs the 20000 tokens with the highest document frequencies.
The CommmonTokensAnalyzer intentionally drops tokens shorter than 4 characters,
but includes bigrams for cjk.
It also has a white list for __email__ and __url__ and a black list
for common html markup terms.