Microsoft scientists find most AI models struggle with long-running tasks (www.techradar.com)

🤖 AI Summary
Microsoft researchers have revealed that most current AI models, particularly large language models (LLMs), struggle with long-running tasks, introducing significant errors that compound over time. Their study introduced the DELEGATE-52 benchmark, which evaluates model performance across various domains, such as coding, science, and accounting, using real documents of approximately 15,000 tokens. Findings showed that models like Gemini 3.1 Pro and GPT-5.4 can corrupt about 25% of document content during extended workflows, emphasizing that current AI capabilities are unreliable for autonomous, lengthy processes. The research highlighted that while highly programmatic tasks, particularly in Python, demonstrate better performance from AI, natural language and creative workflows present substantial challenges. With Gemini 3.1 Pro achieving the highest DELEGATE-52 score at 80.9% through 20 interactions, the study sheds light on the critical need for model improvement in handling multi-step tasks and error management. This benchmark not only provides insights into existing capabilities but also delineates future areas for development, underscoring the ongoing evolution necessary for effective agentic AI deployment.
Loading comments...
loading comments...