New paper: A single character can make or break your LLM evals (arxiv.org)

0 points 5 hours ago ago | visit original

🤖 AI Summary

A new paper shows that a single character used to separate in‑context examples can make or break LLM evaluations. The authors systematically vary simple delimiters (comma, newline, semicolon, hashtag, etc.) across leading model families (Llama, Qwen, Gemma) and find performance on benchmarks like MMLU can swing by roughly ±23% depending only on that choice. By changing the separator character alone, evaluators can even reorder model rankings—exposing a fragile evaluation axis that’s independent of model size, topic, or family. Technically, the paper links the effect to attention patterns: “good” delimiters steer attention heads toward task‑relevant tokens, improving model behavior, while other delimiters scatter attention and degrade performance. The authors show this brittleness persists at scale and propose mitigations—most effectively, explicitly specifying the chosen delimiter inside the prompt to boost robustness—and provide practical recommendations for which delimiters tend to work best. The takeaway for the AI/ML community is clear: prompt formatting is a major confounder in LLM benchmarking and real‑world use; evaluations and leaderboards should standardize and report delimiter choices (or test multiple separators) to avoid misleading conclusions and enable reproducible comparisons.

Loading comments...

loading comments...