🤖 AI Summary
Scale AI released SEAL Showdown, a live leaderboard that ranks LLMs by collecting in-situ human preference data during real chat sessions. Rather than static prompts, users periodically compare the response they’re seeing (in‑flow) with one from a randomly sampled opponent (out‑of‑flow); responses stream side‑by‑side with model identities hidden and users choose left/right/both good/both bad. Rankings are produced with a Bradley–Terry model (logistic pairwise strengths), bootstrapped for confidence intervals and converted to Elo (anchor: Llama4 Maverick, β=1000, scale=400). Crucially, Showdown augments the BT model with explicit style controls (token‑count difference, Markdown formatting difference, loading‑time difference) via additional logistic regressors (γ⊤ϕ) and uses a sampling strategy that prioritizes under‑evaluated, high‑variance pairs to balance comparisons. Preliminary standings place GPT‑5 Chat first and Claude Opus 4.1 second.
The report’s key findings and implications matter for evaluation and product teams: stylistic features strongly bias human judgments (e.g., a response 2,000 tokens longer can raise win rate from ~20% to ~67%), so controlling for presentation is essential to isolate capability. Models with extra “thinking” (test‑time compute) didn’t consistently beat non‑thinking variants on everyday conversational tasks, suggesting diminishing returns for prolonged internal reasoning in casual use. By capturing live, contextual preferences and correcting for presentation confounders, SEAL Showdown complements static benchmarks and shifts focus toward real‑world UX factors (verbosity, formatting, latency) when assessing and deploying LLMs.
Loading comments...
login to comment
loading comments...
no comments yet