Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. === Probe length: 8B === [INFO] Maveniverse Nisse 0.7.0 loaded [INFO] Nisse injecting 27 properties into User Properties [INFO] Scanning for projects... [WARNING] [WARNING] Some problems were encountered while building the effective model for org.apache.tika:tika-parent:pom:4.0.0-SNAPSHOT [WARNING] 'version' contains an expression but should be a constant. @ org.apache.tika:tika-parent:${revision}, /Users/tallison/Intellij/tika-main-chardet/tika-parent/pom.xml, line 35, column 12 [WARNING] [WARNING] Some problems were encountered while building the effective model for org.apache.tika:tika:pom:4.0.0-SNAPSHOT [WARNING] 'version' contains an expression but should be a constant. @ org.apache.tika:tika-parent:${revision}, /Users/tallison/Intellij/tika-main-chardet/tika-parent/pom.xml, line 35, column 12 [WARNING] [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. [WARNING] [WARNING] For this reason, future Maven versions might no longer support building such malformed projects. [WARNING] [INFO] No need for inlining [INFO] [INFO] --------------< org.apache.tika:tika-langdetect-charsoup >-------------- [INFO] Building Apache Tika langdetect (built-in charsoup) 4.0.0-SNAPSHOT [INFO] from pom.xml [INFO] --------------------------------[ jar ]--------------------------------- [INFO] [INFO] --- exec:3.6.3:java (default-cli) @ tika-langdetect-charsoup --- CharSoup strategy: STANDARD Evaluation threads: 12 Loading test data: /Users/tallison/datasets/flores-200/flores200_dev.tsv Flores-200 mode: normalizing xxx_Yyyy → xxx codes (multi-script variants kept as xxx_Yyyy separate classes) Test sentences: 203,381 Loading CharSoup model: /Users/tallison/datasets/wikipedia-model-v14/langdetect-v14.bin CharSoup model: 204 classes, 32768 buckets, flags=0xE81, ~8.1 MB heap Evaluation routes through CharSoupLanguageDetector (script gate + confusable group collapse). Loading OpenNLP detector(s)... SLF4J(W): No SLF4J providers were found. SLF4J(W): Defaulting to no-operation (NOP) logger implementation SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details. Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector Loaded: org.apache.tika.langdetect.opennlp.OpenNLPDetector OpenNLP: 12 instance(s), ~79.2 MB heap Loading Lingua detector (low accuracy mode)... Loaded Lingua (low accuracy mode, 75 languages), ~0.0 MB heap Loading Optimaize detector(s)... Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Loaded: org.apache.tika.langdetect.optimaize.OptimaizeLangDetector Optimaize: 12 instance(s), ~94.5 MB heap Warming up (200 iterations)... VERIFY: predictions use 204 classes, 32,768 buckets, flags=0xE81 (from file) CharSoup ∩ OpenNLP: 105 languages, 104,684 sentences CharSoup ∩ Lingua: 71 languages, 70,784 sentences (Lingua covers 75) CharSoup ∩ Optimaize: 63 languages, 61,811 sentences Evaluating @20 ... charsoup= 80.51% opennlp= 72.85% lingua= 75.88% optimaize= 84.38% Evaluating @50 ... charsoup= 94.09% opennlp= 85.43% lingua= 90.63% optimaize= 94.29% Evaluating @100 ... charsoup= 97.00% opennlp= 90.12% lingua= 95.33% optimaize= 96.50% Evaluating @150 ... charsoup= 97.47% opennlp= 90.97% lingua= 96.11% optimaize= 96.76% Evaluating @200 ... charsoup= 97.51% opennlp= 91.11% lingua= 96.20% optimaize= 96.79% Evaluating @500 ... charsoup= 97.52% opennlp= 91.12% lingua= 96.22% optimaize= 96.81% Evaluating full ... charsoup= 97.52% opennlp= 91.12% lingua= 96.22% optimaize= 96.81% === Language Detection Comparison Report === Test sentences: 203,381 CharSoup ∩ OpenNLP: 105 languages, 104,684 sentences CharSoup ∩ Lingua: 71 languages, 70,784 sentences CharSoup ∩ Optimaize: 63 languages, 61,811 sentences Model heap (approx): CharSoup: ~8.1 MB OpenNLP: ~79.2 MB Lingua: ~0.1 MB (low accuracy mode) Optimaize: ~94.5 MB Coverage-adjusted accuracy — each detector scored on its own supported languages only (test sentences whose true language is not in a detector's covered set are skipped) ─ CharSoup ─ ─ OpenNLP ─ ── Lingua ── ─ Optimaize ─ CS(ms) ON(ms) Li(ms) Opt(ms) CS sent/s Length mF1 acc mF1 acc mF1 acc mF1 acc ---------------------------------------------------------------------------------------------------------------------- @20 82.51% 80.51% 74.87% 72.85% 76.35% 75.88% 84.87% 84.38% 932 769 10,122 400 139,061 @50 94.44% 94.09% 86.09% 85.43% 90.99% 90.63% 94.44% 94.29% 1,000 657 18,646 2,048 129,605 @100 96.98% 97.00% 90.25% 90.12% 95.43% 95.33% 96.51% 96.50% 1,015 1,200 30,831 2,051 127,690 @150 97.41% 97.47% 90.98% 90.97% 96.15% 96.11% 96.72% 96.76% 1,087 1,592 37,481 2,119 119,232 @200 97.45% 97.51% 91.11% 91.11% 96.23% 96.20% 96.75% 96.79% 1,115 1,820 39,759 2,127 116,238 @500 97.46% 97.52% 91.12% 91.12% 96.25% 96.22% 96.76% 96.81% 1,118 2,033 40,426 2,221 115,926 full 97.46% 97.52% 91.12% 91.12% 96.25% 96.22% 96.76% 96.81% 1,121 1,820 40,505 2,139 115,616 Breadth-weighted accuracy — all 203 FLORES languages, unsupported languages score 0 (penalises limited coverage; use this to compare total useful output across all inputs) ─ CharSoup ─ ─ OpenNLP ─ ── Lingua ── ─ Optimaize ─ Length mF1 acc mF1 acc mF1 acc mF1 acc ---------------------------------------------------------------------- @20 52.43% 51.31% 42.05% 40.71% 27.46% 27.53% 26.76% 26.89% @50 60.01% 59.96% 48.35% 47.74% 32.72% 32.88% 29.77% 30.04% @100 61.63% 61.81% 50.68% 50.36% 34.32% 34.58% 30.43% 30.75% @150 61.90% 62.11% 51.09% 50.84% 34.58% 34.86% 30.49% 30.83% @200 61.93% 62.14% 51.16% 50.91% 34.61% 34.90% 30.50% 30.84% @500 61.93% 62.15% 51.17% 50.92% 34.61% 34.90% 30.51% 30.84% full 61.93% 62.15% 51.17% 50.92% 34.61% 34.90% 30.51% 30.84% Strict accuracy — CharSoup ∩ OpenNLP (105 languages, 104,684 sentences) ── CharSoup ── ── OpenNLP ── CS(ms) OpenNLP(ms) CS sent/s Length mF1 acc mF1 acc ------------------------------------------------------------------------ @20 84.23% 81.44% 76.69% 74.07% 504 228 207,706 @50 95.73% 95.12% 87.27% 86.37% 431 333 242,886 @100 98.09% 97.93% 91.05% 90.79% 523 620 200,161 @150 98.47% 98.36% 91.68% 91.55% 578 852 181,114 @200 98.51% 98.40% 91.77% 91.65% 583 969 179,561 @500 98.51% 98.41% 91.78% 91.67% 593 956 176,533 full 98.51% 98.41% 91.78% 91.67% 604 959 173,318 Strict accuracy — CharSoup ∩ Lingua (71 languages, 70,784 sentences) ── CharSoup ── ── Lingua ── CS(ms) Lingua(ms) CS sent/s Length mF1 acc mF1 acc ------------------------------------------------------------------------ @20 85.28% 81.44% 77.78% 76.80% 303 2,960 233,611 @50 96.37% 95.65% 92.24% 91.51% 296 5,992 239,135 @100 98.51% 98.37% 96.58% 96.20% 359 9,756 197,170 @150 98.80% 98.70% 97.27% 96.96% 395 11,847 179,200 @200 98.82% 98.73% 97.37% 97.07% 437 12,615 161,977 @500 98.83% 98.75% 97.38% 97.09% 409 12,840 173,066 full 98.83% 98.75% 97.38% 97.09% 402 12,847 176,080 Strict accuracy — CharSoup ∩ Optimaize (63 languages, 61,811 sentences) ── CharSoup ── ── Optimaize ── CS(ms) Optimaize(ms) CS sent/s Length mF1 acc mF1 acc ------------------------------------------------------------------------ @20 86.52% 82.78% 86.30% 85.54% 215 76 287,493 @50 96.92% 96.15% 95.40% 95.14% 250 367 247,244 @100 98.84% 98.66% 97.23% 97.17% 320 360 193,159 @150 99.12% 98.98% 97.33% 97.34% 342 374 180,734 @200 99.13% 99.01% 97.34% 97.35% 346 374 178,645 @500 99.14% 99.02% 97.34% 97.35% 354 377 174,607 full 99.14% 99.02% 97.34% 97.35% 353 385 175,102 CharSoup timing (wall-clock, full pipeline including script gate + group collapse): Length Wall(ms) Sent/sec -------------------------------- @20 932 139,061 @50 1,000 129,605 @100 1,015 127,690 @150 1,087 119,232 @200 1,115 116,238 @500 1,118 115,926 full 1,121 115,616 Per-language CharSoup F1 by length: Language @20 @50 @100 @150 @200 @500 full ---------------------------------------------------------------------- ace 75.76% 90.52% 95.50% 96.15% 96.31% 96.31% 96.31% afr 79.05% 96.87% 99.40% 99.55% 99.55% 99.55% 99.55% aka 86.13% 96.00% 98.73% 99.19% 99.24% 99.24% 99.24% amh 96.55% 99.75% 99.95% 99.95% 99.95% 99.95% 99.95% ara 96.97% 99.95% 99.95% 99.95% 99.95% 99.95% 99.95% asm 94.88% 99.65% 100.00% 100.00% 100.00% 100.00% 100.00% azb 61.48% 78.64% 88.32% 90.51% 90.93% 90.93% 90.93% aze 86.02% 98.27% 99.70% 99.80% 99.80% 99.80% 99.80% bak 86.65% 98.33% 99.70% 99.85% 99.80% 99.80% 99.80% ban 66.78% 90.21% 95.67% 96.37% 96.43% 96.48% 96.48% bel 79.61% 89.31% 93.72% 95.27% 95.77% 96.04% 96.04% ben 95.04% 99.65% 100.00% 100.00% 100.00% 100.00% 100.00% bjn 63.37% 87.76% 95.56% 96.64% 96.69% 96.69% 96.69% bod 99.85% 99.85% 100.00% 100.00% 100.00% 100.00% 100.00% bul 79.49% 97.29% 99.65% 99.65% 99.70% 99.70% 99.70% cat 76.97% 97.07% 99.85% 99.90% 99.90% 99.90% 99.90% ceb 69.18% 85.41% 89.13% 90.07% 90.13% 89.68% 89.68% ces 80.35% 97.56% 99.70% 99.80% 99.80% 99.80% 99.80% ckb 98.79% 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% cym 92.49% 99.45% 100.00% 100.00% 100.00% 100.00% 100.00% dan 64.55% 88.74% 95.20% 97.03% 97.23% 97.28% 97.28% deu 76.67% 96.48% 99.50% 99.65% 99.65% 99.65% 99.65% ell 99.19% 99.90% 100.00% 100.00% 100.00% 100.00% 100.00% eng 71.32% 89.32% 96.83% 98.76% 98.96% 98.96% 98.96% epo 79.13% 97.22% 99.80% 99.75% 99.75% 99.75% 99.75% est 84.39% 98.85% 99.80% 99.70% 99.70% 99.70% 99.70% eus 85.02% 98.94% 99.95% 100.00% 100.00% 100.00% 100.00% ewe 89.63% 98.21% 99.24% 99.55% 99.55% 99.55% 99.55% fao 83.36% 97.32% 99.60% 99.70% 99.70% 99.70% 99.70% fas 82.05% 90.25% 93.38% 94.05% 94.22% 94.27% 94.27% fin 85.52% 99.40% 99.90% 100.00% 100.00% 100.00% 100.00% fra 80.60% 97.94% 99.75% 99.90% 99.90% 99.90% 99.90% gla 91.03% 99.30% 99.95% 100.00% 100.00% 100.00% 100.00% gle 91.04% 99.20% 99.95% 100.00% 100.00% 100.00% 100.00% glg 66.96% 92.77% 98.54% 99.25% 99.30% 99.25% 99.25% grn 87.45% 98.32% 99.75% 99.75% 99.80% 99.75% 99.75% guj 99.90% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% hau 87.08% 98.02% 99.90% 99.95% 99.95% 99.95% 99.95% heb 99.09% 99.85% 100.00% 100.00% 100.00% 100.00% 100.00% hin 67.83% 84.81% 94.29% 96.31% 96.45% 96.50% 96.50% hrv 79.00% 96.91% 99.45% 99.60% 99.60% 99.60% 99.60% hun 89.96% 99.14% 99.95% 100.00% 100.00% 100.00% 100.00% hye 88.85% 97.74% 99.50% 99.80% 99.80% 99.80% 99.80% ibo 92.52% 99.20% 99.90% 99.95% 99.95% 99.95% 99.95% ilo 83.73% 98.35% 99.65% 99.90% 99.90% 99.90% 99.90% ind 48.96% 64.99% 74.69% 78.30% 78.34% 78.42% 78.42% isl 87.26% 97.65% 99.65% 99.70% 99.70% 99.70% 99.70% ita 72.86% 96.39% 99.20% 99.55% 99.60% 99.60% 99.60% jav 65.42% 90.74% 97.48% 98.17% 98.32% 98.36% 98.36% jpn 99.75% 99.85% 100.00% 100.00% 100.00% 100.00% 100.00% kab 92.64% 99.70% 99.90% 99.95% 99.95% 99.95% 99.95% kan 99.70% 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% kat 96.47% 99.60% 100.00% 100.00% 100.00% 100.00% 100.00% kaz 88.65% 98.99% 99.85% 99.90% 99.90% 99.90% 99.90% khm 98.68% 99.85% 99.90% 99.95% 99.95% 99.95% 99.95% kin 87.16% 98.94% 99.90% 99.95% 99.95% 99.95% 99.95% kir 84.53% 98.54% 99.75% 99.80% 99.80% 99.80% 99.80% kor 99.70% 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% kur 86.71% 98.13% 99.70% 99.90% 99.90% 99.90% 99.90% lao 96.47% 99.19% 99.85% 99.95% 99.95% 99.95% 99.95% lav 89.57% 99.45% 100.00% 100.00% 100.00% 100.00% 100.00% lim 72.83% 94.37% 98.23% 98.48% 98.48% 98.48% 98.48% lit 84.98% 99.14% 99.95% 100.00% 100.00% 100.00% 100.00% ltz 72.99% 95.15% 99.30% 99.75% 99.80% 99.80% 99.80% lug 88.63% 98.33% 99.80% 99.70% 99.70% 99.70% 99.70% lus 78.91% 96.19% 99.19% 99.70% 99.70% 99.70% 99.70% mal 99.70% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% mar 68.58% 89.27% 96.75% 97.71% 97.76% 97.76% 97.76% min 71.18% 92.77% 97.66% 98.17% 98.17% 98.17% 98.17% mkd 81.88% 96.57% 98.96% 99.35% 99.40% 99.40% 99.40% mlg 93.11% 99.09% 99.90% 99.90% 99.90% 99.90% 99.90% mlt 89.52% 99.40% 99.95% 99.95% 99.95% 99.95% 99.95% mon 93.76% 99.65% 100.00% 100.00% 100.00% 100.00% 100.00% msa 53.19% 70.58% 79.67% 82.07% 82.07% 82.28% 82.28% mya 98.63% 99.60% 99.90% 100.00% 100.00% 100.00% 100.00% nep 69.72% 87.89% 96.84% 98.23% 98.38% 98.43% 98.43% nld 70.17% 91.92% 98.33% 98.69% 98.74% 98.80% 98.80% nno 62.16% 86.99% 95.10% 95.78% 95.89% 95.89% 95.89% nob 55.16% 80.25% 91.22% 93.49% 93.67% 93.73% 93.73% nso 59.08% 78.00% 83.14% 86.45% 86.96% 87.47% 87.47% nya 75.76% 93.55% 97.48% 98.26% 98.07% 98.17% 98.17% ori 99.65% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% pan 99.75% 99.75% 99.90% 99.95% 99.95% 99.95% 99.95% pap 75.85% 96.37% 99.34% 99.50% 99.55% 99.55% 99.55% pol 77.95% 88.59% 91.76% 93.31% 93.66% 93.53% 93.53% por 71.57% 94.71% 99.00% 99.60% 99.60% 99.60% 99.60% pus 90.86% 98.14% 99.65% 99.80% 99.80% 99.80% 99.80% ron 85.11% 98.89% 99.80% 99.90% 99.85% 99.85% 99.85% rus 81.44% 98.12% 99.75% 99.80% 99.80% 99.80% 99.80% san 66.37% 85.60% 94.81% 96.17% 96.33% 96.33% 96.33% sat 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% sin 99.70% 99.90% 99.95% 99.95% 99.95% 99.95% 99.95% slk 78.03% 97.69% 99.70% 99.75% 99.75% 99.75% 99.75% slv 77.88% 96.86% 99.55% 99.65% 99.65% 99.65% 99.65% smo 90.06% 99.25% 99.75% 99.90% 99.95% 99.95% 99.95% sna 81.50% 98.84% 99.25% 99.50% 99.55% 99.55% 99.55% snd 96.94% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% som 91.49% 99.30% 99.90% 99.90% 99.90% 99.90% 99.90% spa 67.16% 92.46% 98.56% 99.55% 99.55% 99.55% 99.55% sqi 90.69% 99.50% 99.95% 100.00% 100.00% 100.00% 100.00% srp 81.62% 96.99% 99.04% 99.45% 99.45% 99.45% 99.45% sun 62.71% 86.69% 94.68% 95.94% 96.09% 96.09% 96.09% swe 68.70% 93.19% 98.64% 98.69% 98.74% 98.74% 98.74% swh 83.89% 98.25% 99.60% 99.65% 99.65% 99.65% 99.65% szl 71.08% 85.39% 90.02% 92.16% 92.68% 92.51% 92.51% tam 99.75% 99.85% 99.95% 100.00% 100.00% 100.00% 100.00% tat 81.83% 97.91% 99.55% 99.80% 99.80% 99.80% 99.80% tel 98.16% 99.19% 99.55% 99.70% 99.75% 99.75% 99.75% tgk 92.72% 99.75% 100.00% 100.00% 100.00% 100.00% 100.00% tgl 76.57% 95.52% 99.10% 99.40% 99.45% 99.45% 99.45% tha 98.78% 99.80% 100.00% 100.00% 100.00% 100.00% 100.00% tir 97.01% 99.90% 99.95% 99.95% 99.95% 99.95% 99.95% tsn 72.74% 84.73% 87.50% 89.45% 89.77% 90.09% 90.09% tso 87.72% 98.12% 99.55% 99.65% 99.65% 99.65% 99.65% tuk 86.99% 98.84% 100.00% 100.00% 100.00% 100.00% 100.00% tum 76.64% 94.25% 97.70% 98.32% 98.17% 98.27% 98.27% tur 74.03% 95.70% 99.55% 99.60% 99.60% 99.60% 99.60% uig 98.63% 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% ukr 88.63% 99.45% 99.90% 99.90% 99.90% 99.90% 99.90% urd 87.93% 97.53% 99.70% 99.95% 99.95% 99.95% 99.95% uzb 78.39% 97.63% 99.65% 99.85% 99.85% 99.85% 99.85% vie 95.37% 99.80% 100.00% 100.00% 100.00% 100.00% 100.00% war 55.01% 70.51% 77.89% 81.10% 81.45% 81.10% 81.10% xho 68.76% 83.11% 89.89% 90.87% 91.00% 91.00% 91.00% ydd 99.25% 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% yor 88.40% 98.75% 99.70% 99.85% 99.85% 99.85% 99.85% yue 2.34% 0.99% 0.99% 0.99% 0.99% 0.99% 0.99% zho 79.04% 79.84% 79.92% 79.92% 79.92% 79.92% 79.92% zul 63.46% 78.70% 87.89% 89.29% 89.52% 89.52% 89.52% Per-language macro F1 (full): Language CharSoup OpenNLP Lingua Optimaize ---------------------------------------------------------- ace 96.31% N/A N/A N/A afr 99.55% 96.34% 96.40% 98.61% aka 99.24% N/A N/A N/A amh 99.95% 99.95% N/A N/A ara 99.95% 100.00% 100.00% 100.00% asm 100.00% 99.90% N/A N/A azb 90.93% N/A N/A N/A aze 99.80% 99.25% 98.99% N/A bak 99.80% 97.85% N/A N/A ban 96.48% 43.66% N/A N/A bel 96.04% 100.00% 100.00% 100.00% ben 100.00% 99.85% 100.00% 100.00% bjn 96.69% N/A N/A N/A bod 100.00% N/A N/A N/A bul 99.70% 98.33% 98.11% 99.14% cat 99.90% 98.03% 98.43% 86.74% ceb 89.68% 32.41% N/A N/A ces 99.80% 99.50% 98.79% 99.90% ckb 100.00% 0.00% N/A N/A cym 100.00% 99.95% 98.66% 99.85% dan 97.28% 95.45% 95.01% 98.06% deu 99.65% 99.30% 99.65% 99.40% ell 100.00% 99.95% 100.00% 100.00% eng 98.96% 97.22% 97.85% 98.13% epo 99.75% 98.75% 98.34% N/A est 99.70% 78.32% 99.50% 99.80% eus 100.00% 98.90% 99.15% 99.85% ewe 99.55% N/A N/A N/A fao 99.70% 97.72% N/A N/A fas 94.27% 0.59% 99.30% 99.95% fin 100.00% 99.39% 99.60% 99.65% fra 99.90% 99.25% 99.40% 99.55% gla 100.00% 99.55% N/A N/A gle 100.00% 99.50% 99.65% 99.95% glg 99.25% 95.11% N/A 97.22% grn 99.75% N/A N/A N/A guj 100.00% 100.00% 100.00% 100.00% hau 99.95% 97.59% N/A N/A heb 100.00% 99.85% 100.00% 99.95% hin 96.50% 89.24% 88.27% 99.90% hrv 99.60% 66.67% 68.24% 98.65% hun 100.00% 99.75% 99.80% 100.00% hye 99.80% 100.00% 100.00% N/A ibo 99.95% 99.55% N/A N/A ilo 99.90% N/A N/A N/A ind 78.42% 36.55% 78.31% 69.10% isl 99.70% 97.91% 99.70% 100.00% ita 99.60% 97.93% 98.30% 99.25% jav 98.36% 73.06% N/A N/A jpn 100.00% 99.65% 100.00% 68.12% kab 99.95% N/A N/A N/A kan 100.00% 100.00% N/A 100.00% kat 100.00% 100.00% 100.00% N/A kaz 99.90% 99.40% 97.61% N/A khm 99.95% 99.95% N/A 100.00% kin 99.95% 99.04% N/A N/A kir 99.80% 98.78% N/A N/A kor 100.00% 99.55% 100.00% 99.80% kur 99.90% 96.79% N/A N/A lao 99.95% 99.95% N/A N/A lav 100.00% 67.86% 99.30% 100.00% lim 98.48% 86.96% N/A N/A lit 100.00% 99.24% 99.50% 99.90% ltz 99.80% 98.99% N/A N/A lug 99.70% 97.64% 98.84% N/A lus 99.70% N/A N/A N/A mal 100.00% 100.00% N/A 100.00% mar 97.76% 97.97% 89.67% 100.00% min 98.17% 80.27% N/A N/A mkd 99.40% 97.59% 97.23% 96.98% mlg 99.90% 98.13% N/A N/A mlt 99.95% 99.34% N/A 99.90% mon 100.00% 99.90% 99.15% N/A msa 82.28% 67.94% 80.78% 40.12% mya 100.00% 100.00% N/A N/A nep 98.43% 97.56% N/A 99.90% nld 98.80% 90.66% 96.43% 98.53% nno 95.89% 88.95% 89.86% N/A nob 93.73% 87.81% 87.90% N/A nso 87.47% 95.91% N/A N/A nya 98.17% N/A N/A N/A ori 100.00% 100.00% N/A N/A pan 99.95% 99.95% 100.00% 100.00% pap 99.55% N/A N/A N/A pol 93.53% 99.70% 99.65% 99.95% por 99.60% 98.03% 98.40% 98.19% pus 99.80% 95.25% N/A N/A ron 99.85% 98.95% 97.34% 100.00% rus 99.80% 98.22% 98.51% 99.65% san 96.33% 84.87% N/A N/A sat 100.00% N/A N/A N/A sin 99.95% 100.00% N/A N/A slk 99.75% 99.14% 98.55% 99.70% slv 99.65% 97.72% 98.03% 98.60% smo 99.95% N/A N/A N/A sna 99.55% N/A 99.00% N/A snd 100.00% 99.95% N/A N/A som 99.90% 99.60% 99.80% 99.95% spa 99.55% 84.02% 98.05% 92.82% sqi 100.00% 99.70% 99.70% 100.00% srp 99.45% 99.14% 97.61% 97.28% sun 96.09% N/A N/A N/A swe 98.74% 98.42% 98.45% 99.50% swh 99.65% N/A N/A N/A szl 92.51% N/A N/A N/A tam 100.00% 100.00% 100.00% 100.00% tat 99.80% 97.72% N/A N/A tel 99.75% 99.55% 99.95% 100.00% tgk 100.00% 100.00% N/A N/A tgl 99.45% 54.09% 96.87% 99.65% tha 100.00% 99.55% 97.90% 100.00% tir 99.95% N/A N/A N/A tsn 90.09% 95.18% 92.46% N/A tso 99.65% N/A 98.90% N/A tuk 100.00% 99.75% N/A N/A tum 98.27% N/A N/A N/A tur 99.60% 98.53% 98.85% 99.95% uig 100.00% 91.88% N/A N/A ukr 99.90% 99.60% 97.86% 99.85% urd 99.95% 96.75% 99.30% 100.00% uzb 99.85% 98.21% N/A N/A vie 100.00% 99.90% 99.05% 100.00% war 81.10% 11.58% N/A N/A xho 91.00% 84.42% 90.94% N/A ydd 100.00% N/A N/A N/A yor 99.85% 98.58% 97.38% N/A yue 0.99% N/A N/A N/A zho 79.92% N/A 100.00% 87.76% zul 89.52% 81.80% 91.63% N/A CharSoup top confusions (languages with F1 < 95%, @20): TrueLabel F1 Top misclassifications (predicted → count) ------------------------------------------------------------------------ yue 2.3% zho→967, eng→4, ilo→2, spa→1, tay→1, tur→1, bar→1 ind 49.0% msa→267, jav→27, bjn→25, sun→20, ban→19, min→17, tet→11 msa 53.2% ind→239, jav→24, ban→14, sun→14, bjn→13, min→9, pam→7 war 55.0% bcl→183, ceb→122, hil→102, som→14, bre→10, ilo→10, diq→9 nob 55.2% dan→129, nno→125, swe→14, vls→10, diq→9, fry→9, spa→6 nso 59.1% tsn→312, diq→20, smo→19, kur→9, bre→8, kha→8, cym→8 azb 61.5% fas→217, mzn→115, pnb→85, pus→62, urd→41, ara→8, snd→6 nno 62.2% nob→154, dan→37, swe→19, diq→15, bre→7, eus→7, lav→6 sun 62.7% jav→50, msa→46, ban→29, ind→26, bjn→17, min→13, diq→9 bjn 63.4% msa→64, min→60, sun→54, ind→41, jav→32, szy→17, ban→16 zul 63.5% xho→368, kin→10, nya→10, hrv→8, nob→4, ibo→4, lug→4 dan 64.6% nob→136, nno→65, swe→21, diq→10, deu→7, afr→7, ltz→7 jav 65.4% sun→43, msa→42, ind→28, ban→27, diq→16, bjn→11, min→11 san 66.4% hin→183, mar→106, nep→92, gom→6, ltz→3, bre→3, tur→1 ban 66.8% jav→68, ind→58, msa→54, sun→20, bjn→15, est→10, pam→10 glg 67.0% por→89, arg→60, spa→32, cat→26, lfn→23, ina→20, mwl→14 spa 67.2% cat→55, lfn→49, glg→37, mwl→23, arg→23, ina→15, roh→14 hin 67.8% nep→89, mar→66, san→41, gom→6, spa→1 mar 68.6% hin→187, san→85, nep→72, gom→9 swe 68.7% nno→70, nob→64, dan→39, diq→14, fao→12, isl→8, cat→8 xho 68.8% zul→96, kin→13, nya→9, tso→7, smo→5, eng→5, hrv→4 ceb 69.2% tgl→139, hil→88, bcl→33, ilo→6, szy→5, war→4, lus→3 nep 69.7% hin→180, san→79, mar→65, gom→2, por→1, trv→1 nld 70.2% vls→66, afr→64, lim→47, nds→37, gsw→20, deu→17, ltz→17 szl 71.1% pol→276, hsb→17, ces→11, slk→10, hrv→8, diq→5, slv→5 min 71.2% bjn→46, ind→28, jav→26, sun→24, msa→20, ban→13, tgl→13 eng 71.3% diq→14, tsn→14, ile→13, frr→11, ina→11, fra→10, lat→9 por 71.6% glg→81, ina→24, cat→23, arg→21, spa→19, mwl→16, lfn→9 tsn 72.7% nso→50, smo→16, diq→9, bre→5, yor→5, ltz→5, kha→4 lim 72.8% vls→35, afr→33, nld→28, fry→18, ron→17, nds→14, frr→12 ita 72.9% cos→57, ina→38, roh→33, lfn→24, cat→19, ido→13, mwl→11 ltz 73.0% gsw→41, deu→25, nds→24, nob→13, lim→10, fry→10, swe→9 tur 74.0% diq→113, aze→22, tuk→15, slv→7, bar→5, bre→5, uzb→5 ace 75.8% sun→116, ind→26, min→18, msa→18, ban→12, avk→11, jav→9 nya 75.8% tum→59, swh→16, diq→16, zul→7, lug→4, gom→3, lus→3 pap 75.8% ido→32, jav→16, diq→15, bre→15, lfn→15, tet→13, spa→11 tgl 76.6% ceb→67, bcl→38, hil→30, lus→6, ban→6, pam→6, jav→5 tum 76.6% nya→143, swh→31, sna→12, kin→10, xho→8, lug→8, diq→6 deu 76.7% gsw→50, afr→46, bar→21, ltz→18, dan→13, nds→10, pfl→9 cat 77.0% lfn→20, arg→15, spa→10, fra→9, wln→9, roh→9, diq→8 slv 77.9% hrv→66, ces→15, slk→12, hsb→8, yor→7, diq→5, epo→4 pol 78.0% szl→16, hsb→13, slv→11, ces→8, slk→6, yor→5, diq→4 slk 78.0% ces→76, slv→12, hrv→8, lav→6, diq→5, hun→4, gom→4 uzb 78.4% diq→42, aze→15, kaa→13, tuk→12, hau→10, som→7, mlt→7 lus 78.9% cnh→51, eng→26, ltz→9, diq→8, cat→8, cor→7, lat→7 hrv 79.0% slv→104, slk→17, ces→8, hsb→7, diq→5, cos→4, est→3 zho 79.0% yue→17, szy→7, eng→5, spa→3, frr→3, tay→3, ilo→3 afr 79.1% nld→35, vls→23, lim→17, nds→17, gsw→14, deu→10, frr→8 epo 79.1% ido→52, lfn→16, diq→11, por→10, bre→9, slv→8, spa→7 bul 79.5% mkd→87, rus→26, srp→21, mhr→7, tgk→7, ukr→7, bel→5 bel 79.6% be-x-old→276, ukr→11, kir→3, tgk→3, rus→2, srp→1, tat→1 ces 80.4% slk→67, hrv→10, slv→9, hsb→8, diq→5, yor→5, epo→5 fra 80.6% wln→47, cat→38, bre→15, ina→14, ltz→9, lfn→8, ron→6 rus 81.4% bul→64, srp→28, ukr→26, mkd→21, bel→11, rue→10, be-x-old→9 sna 81.5% nya→29, tum→18, kin→17, xho→15, swh→13, diq→9, jav→7 srp 81.6% mkd→108, bul→44, ukr→15, rus→14, tgk→8, che→5, bel→3 tat 81.8% bak→45, kir→36, rus→19, bul→12, srp→11, sah→10, che→10 mkd 81.9% bul→54, srp→46, rus→8, ukr→5, tat→3, mon→2, ava→2 fas 82.1% mzn→67, pnb→19, pus→18, urd→8, azb→5, spa→2, tur→1 fao 83.4% isl→71, nno→29, diq→5, est→4, bre→4, lat→4, mlt→4 ilo 83.7% szy→15, sun→13, hil→10, diq→8, tgl→8, hau→7, bcl→7 swh 83.9% tum→12, diq→9, kin→8, sna→8, nya→8, hau→6, jav→5 est 84.4% vro→13, vep→11, fin→8, gsw→7, diq→6, frr→6, ban→3 kir 84.5% kaz→22, tyv→19, rus→17, mon→10, tat→10, tgk→9, alt→8 lit 85.0% lav→37, ido→16, sgs→11, epo→8, hrv→8, vep→7, slv→6 eus 85.0% diq→11, hau→7, tet→7, epo→7, slv→7, min→7, avk→7 ron 85.1% cat→16, lfn→10, lat→9, bre→8, ina→8, por→6, arg→5 fin 85.5% est→24, olo→16, vro→9, vep→5, frr→5, smn→5, ltz→5 aze 86.0% tur→60, diq→37, kaa→11, kur→7, ido→4, fin→3, bre→3 aka 86.1% diq→8, eng→8, ltz→8, lat→7, cor→5, nds→5, bre→4 bak 86.6% tat→40, kir→17, rus→14, tyv→12, kaz→11, che→10, tgk→9 kur 86.7% diq→64, ido→5, msa→5, ita→4, frr→4, mlg→3, slk→3 tuk 87.0% tur→20, diq→18, jav→6, avk→5, bre→4, fao→4, yor→4 hau 87.1% diq→6, trv→5, som→4, ltz→4, swh→3, bre→3, kha→3 kin 87.2% swh→12, sna→9, xho→9, diq→7, yor→6, nya→5, lug→5 isl 87.3% fao→52, nno→11, bar→5, nob→3, bre→3, lat→3, hun→2 grn 87.4% spa→13, glg→12, por→11, diq→7, tet→7, epo→7, cat→6 tso 87.7% nya→15, diq→12, swh→11, cos→5, fra→5, sna→5, ltz→5 urd 87.9% pnb→118, fas→17, skr→13, mzn→6, snd→4, azb→3, pus→3 yor 88.4% diq→5, swh→3, jav→3, ilo→3, slk→3, ron→3, cos→2 ukr 88.6% bul→29, srp→19, rus→15, mkd→12, tgk→12, bel→11, be-x-old→5 lug 88.6% kin→17, xho→16, swh→9, diq→9, nya→9, jav→5, szy→5 kaz 88.7% kir→27, tgk→15, bul→13, tat→12, bak→9, tyv→9, rus→9 hye 88.9% hyw→195, spa→2, tur→1, bre→1, eng→1 mlt 89.5% jav→7, diq→6, ltz→5, cos→4, sun→4, avk→4, cat→4 lav 89.6% lit→5, slv→5, nob→5, diq→4, mlt→4, cos→3, slk→3 ewe 89.6% diq→10, eng→8, aka→7, gsw→5, yor→5, ces→5, ibo→5 hun 90.0% slk→8, ltz→7, vep→5, epo→5, eng→5, diq→4, tur→4 smo 90.1% arg→6, ina→6, por→3, glg→3, diq→2, tsn→2, ron→2 sqi 90.7% epo→6, hrv→5, cos→4, est→4, diq→4, mlt→4, slv→4 pus 90.9% pnb→32, fas→11, azb→10, mzn→5, urd→4, ara→1, cor→1 gla 91.0% gle→39, lus→6, bre→3, bar→2, arg→2, eng→2, lav→2 gle 91.0% gla→46, lus→3, nno→3, yor→2, fry→2, trv→2, mwl→1 som 91.5% orm→10, ltz→5, est→4, uzb→4, frr→4, avk→3, ceb→2 cym 92.5% lat→6, gle→6, gsw→4, cor→3, vie→3, smo→3, spa→3 ibo 92.5% swh→6, yor→4, diq→3, hsb→3, tur→2, vie→2, fao→2 kab 92.6% diq→20, ind→4, hun→3, est→3, bre→3, frr→3, cos→2 tgk 92.7% rus→10, che→6, srp→6, lez→5, bel→4, bul→4, mkd→3 mlg 93.1% hrv→4, bre→3, smo→3, tet→3, lus→2, diq→2, tso→2 mon 93.8% bxr→22, tgk→8, kir→7, rus→6, che→6, mkd→4, be-x-old→4 asm 94.9% ben→61, sun→1 CharSoup top confusions (languages with F1 < 95%, full): TrueLabel F1 Top misclassifications (predicted → count) ------------------------------------------------------------------------ yue 1.0% zho→987, eng→3, nno→1 ind 78.4% msa→212 zho 79.9% yue→5, szy→1, spa→1, ita→1, tay→1 war 81.1% ceb→214, hil→76, bcl→25, kin→1, uzb→1 msa 82.3% ind→123, bjn→1 nso 87.5% tsn→218, som→1, epo→1, avk→1, sun→1 zul 89.5% xho→179, cat→1, sna→1 ceb 89.7% tgl→6, hil→4, bcl→1 tsn 90.1% avk→1 azb 90.9% fas→119, mzn→32, pnb→10, pus→2, ara→1, urd→1 xho 91.0% zul→10, sna→1 szl 92.5% pol→138, eng→1 pol 93.5% (no misses recorded) nob 93.7% dan→20, nno→20, swe→1 fas 94.3% mzn→1, eng→1 Report written to: /Users/tallison/datasets/wikipedia-model-v14/flores-v14-eval.log [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 05:36 min [INFO] Finished at: 2026-03-19T17:50:53-04:00 [INFO] ------------------------------------------------------------------------