One ruler to measure them all: Benchmarking multilingual long-context LLMs (arxiv.org)

🤖 AI Summary
Researchers released ONERULER, a multilingual benchmark for evaluating long-context language models across 26 languages by adapting the English-only RULER suite. ONERULER includes seven synthetic tasks that probe retrieval and aggregation capabilities (including new “needle-in-a-haystack” variants that allow the needle to be absent), with English task specs translated into 25 languages by native speakers. The benchmark supports extreme context lengths (tested from 8K up to 128K tokens) and is designed for both open-weight and closed-source LLMs, with code, data and demos provided to the community. Experiments using ONERULER reveal several important findings: as context length grows, performance gaps between high- and low-resource languages widen markedly; surprisingly, English ranks only 6th while Polish ranks first on long-context tasks. Models also struggle with answer-absence calibration—some (notably OpenAI’s o3-mini-high) frequently predict “no answer” even when answers exist in high-resource languages. Cross-lingual setups (instructions in one language, context in another) can change accuracy by up to ~20%. These results highlight that long-context capability and multilingual robustness do not co-evolve automatically, underscoring the need for multilingual/cross-lingual long-context training and evaluation pipelines—ONERULER aims to be a shared yardstick to drive that research.
Loading comments...
loading comments...