Package org.apache.tika.eval.tools
-
Class Summary Class Description BatchTopCommonTokenCounter Utility class that runs TopCommonTokenCounter against a directory of table files (named {lang}_table.gz or leipzip-like afr_...-sentences.txt) and outputs common tokens files for each input table file in the output directory.CommonTokenOverlapCounter LeipzigHelper LeipzigSampler SlowCompositeReaderWrapper COPIED VERBATIM FROM LUCENE This class forces a composite reader (eg aMultiReader
orDirectoryReader
) to emulate aLeafReader
.TopCommonTokenCounter Utility class that reads in a UTF-8 input file with one document per row and outputs the 20000 tokens with the highest document frequencies.TrainTestSplit