Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy (arxiv.org)

0 points 1 day ago ago | visit original

🤖 AI Summary

Researchers present Humans‑Junior, a 3.8B language model that matches GPT‑4o on a public FACTS grounding subset (Q1–Q500) within a ±5 percentage‑point equivalence margin: GPT‑4o scored 73.5% (95% CI 69.5–77.2) vs. Humans‑Junior 72.7% (95% CI 68.7–76.5), a non‑significant paired difference of 0.8 pp (permutation p = 0.72, Cohen’s d = 0.023). The paper also reports big cost savings: when served as managed APIs (Phi‑3.5‑mini‑instruct base), Humans‑Junior is roughly 19× cheaper than GPT‑4o on Microsoft AI Foundry pricing, with self‑hosted or edge deployment able to drive marginal inference cost toward zero. Technically, the win comes from combining minimal, directed “Exoskeleton Reasoning” scaffolds with behavioral fine‑tuning that enforces protocol compliance (epistemic discipline) rather than memorizing domain answers. Fine‑tuning alone had little effect, but together with the scaffold it produced a +17.7 pp accuracy boost (p < 0.001) and ~25% lower variance. The authors also show prompt‑only directed reasoning increases performance on frontier models in exploratory tests (e.g., +11.8 pp for GPT‑4o on Q1–Q100). Implications: structured prompting plus alignment‑focused tuning can let small models rival much larger systems on factual grounding, lowering cost and widening deployment options—though results are tied to the FACTS subset, the judge setup, and the chosen equivalence margin.

Loading comments...

loading comments...