GLM-5 topped the coding benchmarks. Then I used it (charlesazam.com)

0 points 126 days ago ago | visit original

🤖 AI Summary

Zhipu AI's new model, GLM-5, has emerged at the forefront of coding benchmarks, notably achieving a competitive score on Terminal-Bench. However, an independent assessment revealed significant discrepancies in reliability and performance. When tested on KIRO, a novel NP-hard optimization problem, and Terminal-Bench, GLM-5 showed a high invalid output rate—30% of KIRO trials produced invalid results, and nearly a quarter of Terminal-Bench tasks timed out. Despite claiming a notable success rate on these benchmarks, GLM-5's actual effectiveness in practical use, particularly under real-world constraints, was much lower than advertised. This testing highlights critical implications for the AI/ML community regarding model assessments: raw performance metrics can be misleading without considering variability and real-world testing conditions. The stark difference between Zhipu AI's reported scores and independent results—40.4% versus 56.2%—stresses the need for evaluations to reflect user experiences rather than ideal scenarios. Ultimately, while GLM-5 demonstrates potential and is capable of problem-solving across various tasks, its reliability and execution depth need significant improvement before it can be deemed ready for robust, real-world applications.

Loading comments...

loading comments...