What happens if AI labs train for pelicans riding bicycles? (simonwillison.net)

🤖 AI Summary
AI researcher and benchmarker who publishes SVGs of “pelicans riding bicycles” addresses recurring skepticism that labs might be training specifically to ace that quirky test. Their rebuttal: targeted tuning would be easy to spot — a model that suddenly nails pelicans on bikes would still fail on nearby variants (other creatures, different vehicles), revealing overfitting. Even today the best models produce mostly laughable results; getting bicycle geometry and a clearly pedaling pelican in clean SVG vector form remains rare (though the author praises one GPT-5 example). OpenAI’s Aidan McLaughlin is cited denying any hill‑climb tuning for the benchmark (“we do not hill climb on svg art”), and the author notes that training on their published collection would likely backfire, producing oddly distorted pelicans. The piece is significant because it highlights practical guardrails for benchmark design and evaluation: simple, narrow tasks can expose overfitting, and cross‑variant testing is an effective way to detect targeted optimization or data contamination. Technically, the benchmark stresses vector‑graphics generation (SVG) and fine-grained geometric consistency (bike parts, posture, pedaling), areas where current generative models still struggle. The author’s playful “long game” — enticing labs to invest in cheating the benchmark until one accidentally produces a great SVG — underscores how small, hard-to-solve challenges can surface real capability gaps and robustness issues in model development.
Loading comments...
loading comments...