Tarjumi — Evaluation
How we measure translation quality, the verified results, and how to reproduce them.
Principles
- Independent (Path B). The model runs separately from the metric; references are human; the eval shares no component with the system under test (no grammar kernel, no Levelt — those are the system, not the judge).
- Third-party metrics only.
sacrebleuchrF++ (word_order=2, primary — best for morphologically rich African languages) + spBLEU (flores200tokenizer). AfriCOMET (Masakhane) as a learned-metric cross-check. - Contamination-checked. Every eval source is verified absent from training before use (0/1012 FLORES, 0/300 TICO).
- Stratified. Always by domain (general vs specialized) × resource tier (high vs low). Quality is regime-specific — a single aggregate number hides the truth.
- Significance. Gaps are reported with paired-bootstrap 95% CIs; a result counts only if the CI excludes 0.
Datasets (golden, locked, never trained on)
- FLORES-200 devtest — general domain (wiki/news), 1012 sentences/lang.
- TICO-19 — health/medical, ~2100 sentences/lang (we use 500-sentence cells).
- Live in
tarjumi_ml/eval/golden/.split_eligible=false— training/model-selection never touch them.
Verified result (2026-06-04) — v8f vs base NLLB-200, chrF++, paired bootstrap 95% CI
lang res domain base v8f gap 95% CI sig
sw high FLORES 59.94 59.35 −0.59 [−1.43, +0.21] ns
sw high TICO 56.08 54.71 −1.36 [−2.06, −0.70] * (base better)
lg low FLORES 36.17 36.53 +0.35 [−0.30, +0.99] ns
lg low TICO 41.11 44.83 +3.71 [+2.53, +4.93] * (v8f better)
rw low FLORES 45.95 45.88 −0.07 [−0.98, +0.90] ns
rw low TICO 42.41 45.08 +2.67 [+1.78, +3.72] * (v8f better)
so low FLORES 41.83 40.48 −1.35 [−1.96, −0.69] * (base better)
so low TICO 48.07 49.71 +1.64 [+1.02, +2.32] * (v8f better)
Conclusion (statistically supported): the fine-tune + grammar kernel significantly improves low-resource × health translation (lg/rw/so on TICO, all CIs exclude 0) and is neutral-to-negative on high-resource or general-domain text. It earns its place in the regime it was built for. Reranking among beam candidates was separately falsified (near-zero oracle headroom) — the lever is generation (data/model), not selection.
Promotion gate (the rule)
A candidate model ships for a (lang, domain) cell only if it beats the incumbent (or base) with the paired-bootstrap 95% CI excluding 0, and regresses no other cell. Implemented in tarjumi_ml/eval/gate.py; emits scorecard.json consumed by the deploy step.
Reproduce
uv venv && uv pip install -e '.[eval]'
# point/score the committed baseline cells (no GPU/endpoint needed):
python tarjumi_ml/eval/make_matrix.py # reads tarjumi_ml/eval/baseline_2026-06-04/
# or run the gate on candidate vs incumbent hyps:
python -c "from tarjumi_ml.eval import gate; ..." # see gate.decide()
Open questions (honest)
- The per-language ki adapter (worst on FLORES) is untested on health — no public ki/luo health reference exists; needs a human Kikuyu translation of TICO source (
golden/ki_health_translation_task.tsv, 300 sentences). - Fine-tune vs kernel contribution not yet isolated (Track B5 ablation; gated on the
registry.py:170normalization fix, now done). - chrF++ is surface overlap; AfriCOMET + a small native-speaker MQM panel should corroborate.