org.apache.tika.ml.junkdetect.tools.EvalJunkDetector

public class EvalJunkDetector extends Object

Ablation evaluation for the junk detector.

For each script's dev set, scores clean sentences alongside several corruption modes at various injection rates and string lengths. Computes per-cell Cohen's d (discrimination power) and TPR/FPR at a fixed z-score threshold.

Output files in --output-dir:

detail.tsv — one row per (script, distortion, rate, length): script, distortion, param, length, n_clean, n_corrupt, mean_clean_z, mean_corrupt_z, cohens_d, fpr, tpr
summary.tsv — macro-averaged Cohen's d and FPR/TPR per (distortion, rate, length) across all scripts.
compare.tsv — pairwise codec-comparison accuracy using the JunkDetector.compare(java.lang.String, java.lang.String, java.lang.String, java.lang.String) API, stratified by string length. This is the primary metric for the charset-arbitration use case; larger mean delta = better discrimination at that length.

Why char-remap is not in summary.tsv: The character-level wrong-codec substitution (e.g. CP1252→CP1255, replacing umlauts with Hebrew letters) is added to training at a 5% rate. At that rate it is too subtle to detect via the absolute JunkDetector.score(java.lang.String) API — z-score distributions barely separate (Cohen's d ≈ 0). The distortion trains the LR to distinguish subtly-wrong from correct decodings, which only manifests as larger pairwise deltas in JunkDetector.compare(java.lang.String, java.lang.String, java.lang.String, java.lang.String). Measuring it via summary.tsv would produce misleading d≈0 "failure" rows; see compare.tsv instead.

Cohen's d = (mean_clean_z − mean_corrupt_z) / pooled_std. Higher d = better discrimination. FPR = fraction of clean text falsely flagged; TPR = fraction of corrupted text correctly flagged. Both use threshold = −2.0.

To compare two model versions: run eval before and after, then diff the summary and compare TSVs. The "macro_d" column in summary.tsv and the "mean_delta" columns in compare.tsv are the headline metrics.

Usage:

   java EvalJunkDetector \
     --model          /path/to/junkdetect.bin   (default: classpath)
     --data-dir       ~/datasets/madlad/junkdetect
     --output-dir     /path/to/results          (default: data-dir/eval)
     --split          dev|test                  (default: dev)
     --samples        200
     --compare-n      200                       (qualifying pairs per codec pair per length)
     --seed           42
     --lengths        5,9,15,30,50,100,200
     --compare-lengths 5,9,15,30,50
     --rates          0.01,0.05,0.10,0.25,0.50,0.90
     --threshold      -2.0

Constructor Summary

Constructors

Constructor

Description

EvalJunkDetector()
Method Summary

Modifier and Type

Method

Description

static void

main(String[] args)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- EvalJunkDetector
  
  public EvalJunkDetector()
Method Details
- main
  
  public static void main(String[] args) throws Exception
  
  Throws:
  
  Exception

Class EvalJunkDetector

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

EvalJunkDetector

Method Details

main