Giving Your AI a Job Interview (www.oneusefulthing.org)

🤖 AI Summary
AI benchmarking as we know it is breaking down: public benchmark datasets, like MMLU-Pro, suffer from training-set leakage, unclear measuring goals, calibration problems, and noisy or unachievable top scores. At the same time, aggregate trends (e.g., ARC-AGI, METR Long Tasks) do show rising “smarts,” but they hide a jagged frontier—models vary a lot by domain (math, coding, reasoning vs. writing, empathy, business advice). Informal “vibe” tests (pelicans on bikes, otters on planes, creative writing prompts) reveal qualitative differences—examples cited include Claude 4.5’s strong prose, GPT-5’s stylistic riskiness, and Gemini/Kimi’s coherence issues—but are idiosyncratic and unreliable for procurement or safety decisions. The practical fix is to “interview” models: design realistic, domain-specific tasks, run them multiple times to sample stochastic outputs, and have blinded expert graders evaluate results—OpenAI’s GDPval is offered as a template (experts authored multi-hour tasks; humans and models performed them; third-party experts graded blind). This approach surfaces where models outperform humans, where they fail, and how they systematically bias judgments (e.g., risk appetite differences like the “GuacaDrone” example). For organizations, the implication is clear: pick models based on head-to-head, use-case-specific evaluation (including judgmental consistency), repeat evaluations as models evolve, and don’t rely solely on public benchmarks or gut impressions.
Loading comments...
loading comments...