Why does GPT-5.1 Codex underperform GPT-5 Codex on Terminal-Bench? (transluce.org)

0 points 4 hours ago ago | visit original

🤖 AI Summary

Recent evaluations revealed that GPT-5.1 Codex underperforms compared to GPT-5 Codex on the Terminal-Bench by 6.5 percentage points, primarily due to a higher rate of timeout errors. Using their analysis tool, Docent, researchers discovered that GPT-5.1 Codex times out almost twice as often as GPT-5 Codex—49% compared to 29%. This discrepancy, especially with long-running tasks like package installations or training runs, suggests that while GPT-5.1 Codex employs potentially superior strategies when uninterrupted, the evaluation environment cuts these processes short, leading to a skewed performance assessment. Significantly, when filtering the dataset to exclude tasks where either model times out, GPT-5.1 Codex actually outperformed GPT-5 Codex by about 7 percentage points. This finding emphasizes the importance of evaluating AI models within the context of operational constraints. The analysis not only sheds light on the strengths and weaknesses of each model but also highlights how nuanced understanding of evaluation metrics and conditions can inform future advancements in AI/ML deployments. Through tools like Docent, the community can conduct thorough investigations, leading to more accurate assessments of model capabilities and informing future development strategies.

Loading comments...

loading comments...