Class EvalJunkDetector

java.lang.Object
org.apache.tika.ml.junkdetect.tools.EvalJunkDetector

public class EvalJunkDetector extends Object
Ablation evaluation for the junk detector.

For each script's dev set, scores clean sentences alongside several corruption modes at various injection rates and string lengths. Computes per-cell Cohen's d (discrimination power) and TPR/FPR at a fixed z-score threshold.

Output files in --output-dir:

  • detail.tsv — one row per (script, distortion, rate, length): script, distortion, param, length, n_clean, n_corrupt, mean_clean_z, mean_corrupt_z, cohens_d, fpr, tpr
  • summary.tsv — macro-averaged Cohen's d and FPR/TPR per (distortion, rate, length) across all scripts.
  • compare.tsv — pairwise codec-comparison accuracy using the JunkDetector.compare(java.lang.String, java.lang.String, java.lang.String, java.lang.String) API, stratified by string length. This is the primary metric for the charset-arbitration use case; larger mean delta = better discrimination at that length.

Why char-remap is not in summary.tsv: The character-level wrong-codec substitution (e.g. CP1252→CP1255, replacing umlauts with Hebrew letters) is added to training at a 5% rate. At that rate it is too subtle to detect via the absolute JunkDetector.score(java.lang.String) API — z-score distributions barely separate (Cohen's d ≈ 0). The distortion trains the LR to distinguish subtly-wrong from correct decodings, which only manifests as larger pairwise deltas in JunkDetector.compare(java.lang.String, java.lang.String, java.lang.String, java.lang.String). Measuring it via summary.tsv would produce misleading d≈0 "failure" rows; see compare.tsv instead.

Cohen's d = (mean_clean_z − mean_corrupt_z) / pooled_std. Higher d = better discrimination. FPR = fraction of clean text falsely flagged; TPR = fraction of corrupted text correctly flagged. Both use threshold = −2.0.

To compare two model versions: run eval before and after, then diff the summary and compare TSVs. The "macro_d" column in summary.tsv and the "mean_delta" columns in compare.tsv are the headline metrics.

Usage:

   java EvalJunkDetector \
     --model          /path/to/junkdetect.bin   (default: classpath)
     --data-dir       ~/datasets/madlad/junkdetect
     --output-dir     /path/to/results          (default: data-dir/eval)
     --split          dev|test                  (default: dev)
     --samples        200
     --compare-n      200                       (qualifying pairs per codec pair per length)
     --seed           42
     --lengths        5,9,15,30,50,100,200
     --compare-lengths 5,9,15,30,50
     --rates          0.01,0.05,0.10,0.25,0.50,0.90
     --threshold      -2.0
 
  • Constructor Details

    • EvalJunkDetector

      public EvalJunkDetector()
  • Method Details