Tarjumi — Evaluation

How we measure translation quality, the verified results, and how to reproduce them.

Principles

Independent (Path B). The model runs separately from the metric; references are human; the eval shares no component with the system under test (no grammar kernel, no Levelt — those are the system, not the judge).
Third-party metrics only. sacrebleu chrF++ (word_order=2, primary — best for morphologically rich African languages) + spBLEU (flores200 tokenizer). AfriCOMET (Masakhane) as a learned-metric cross-check.
Contamination-checked. Every eval source is verified absent from training before use (0/1012 FLORES, 0/300 TICO).
Stratified. Always by domain (general vs specialized) × resource tier (high vs low). Quality is regime-specific — a single aggregate number hides the truth.
Significance. Gaps are reported with paired-bootstrap 95% CIs; a result counts only if the CI excludes 0.

Datasets (golden, locked, never trained on)

FLORES-200 devtest — general domain (wiki/news), 1012 sentences/lang.
TICO-19 — health/medical, ~2100 sentences/lang (we use 500-sentence cells).
Live in tarjumi_ml/eval/golden/. split_eligible=false — training/model-selection never touch them.

Verified result (2026-06-04) — v8f vs base NLLB-200, chrF++, paired bootstrap 95% CI

lang  res    domain   base    v8f     gap        95% CI            sig
sw    high   FLORES   59.94   59.35   −0.59   [−1.43, +0.21]       ns
sw    high   TICO     56.08   54.71   −1.36   [−2.06, −0.70]       *  (base better)
lg    low    FLORES   36.17   36.53   +0.35   [−0.30, +0.99]       ns
lg    low    TICO     41.11   44.83   +3.71   [+2.53, +4.93]       *  (v8f better)
rw    low    FLORES   45.95   45.88   −0.07   [−0.98, +0.90]       ns
rw    low    TICO     42.41   45.08   +2.67   [+1.78, +3.72]       *  (v8f better)
so    low    FLORES   41.83   40.48   −1.35   [−1.96, −0.69]       *  (base better)
so    low    TICO     48.07   49.71   +1.64   [+1.02, +2.32]       *  (v8f better)

Conclusion (statistically supported): the fine-tune + grammar kernel significantly improves low-resource × health translation (lg/rw/so on TICO, all CIs exclude 0) and is neutral-to-negative on high-resource or general-domain text. It earns its place in the regime it was built for. Reranking among beam candidates was separately falsified (near-zero oracle headroom) — the lever is generation (data/model), not selection.

Promotion gate (the rule)

A candidate model ships for a (lang, domain) cell only if it beats the incumbent (or base) with the paired-bootstrap 95% CI excluding 0, and regresses no other cell. Implemented in tarjumi_ml/eval/gate.py; emits scorecard.json consumed by the deploy step.

Reproduce

uv venv && uv pip install -e '.[eval]'
# point/score the committed baseline cells (no GPU/endpoint needed):
python tarjumi_ml/eval/make_matrix.py            # reads tarjumi_ml/eval/baseline_2026-06-04/
# or run the gate on candidate vs incumbent hyps:
python -c "from tarjumi_ml.eval import gate; ..."  # see gate.decide()

Open questions (honest)

The per-language ki adapter (worst on FLORES) is untested on health — no public ki/luo health reference exists; needs a human Kikuyu translation of TICO source (golden/ki_health_translation_task.tsv, 300 sentences).
Fine-tune vs kernel contribution not yet isolated (Track B5 ablation; gated on the registry.py:170 normalization fix, now done).
chrF++ is surface overlap; AfriCOMET + a small native-speaker MQM panel should corroborate.