Are we in a GPT-4-style leap that evals can't see? (martinalderson.com)

🤖 AI Summary
We may be in another “GPT‑4” style inflection that standard chat evals aren’t capturing. The author argues ad‑hoc chat tests have plateaued: people now value responsiveness and iterative utility as much as raw answer quality, so simple Q&A benchmarks miss real progress. Two recent releases illustrate this gap. Google’s Gemini 3 Pro shows a surprising leap in design capability — extracting design systems from CSS, following screenshots, and producing high‑fidelity HTML prototypes and landing pages via its canvas tool. That output feels like a competent UX/UI designer (fewer “emoji‑chic” defaults), enabling rapid concept→prototype workflows that materially speed product and marketing experiments. Anthropic’s Opus 4.5 reveals complementary advances on the engineering/agent side: it stays on task far longer without derailing, reliably executing complex workflows (e.g., building dashboards over massive ClickHouse datasets) with far less babysitting and ~95% usable output in the author’s experience. Together, these changes expand LLM utility across a greater share of the product lifecycle. The takeaway: current benchmarks — math, science, isolated pass/fail coding tests — undercount gains in “taste” and iterative agent robustness. New evaluations (designer panels, long‑run interactive agent stress tests) are needed to surface progress that could have outsized economic impact.
Loading comments...
loading comments...