We Reached 74.8% on terminal-bench with Terminus-KIRA (krafton-ai.github.io)

0 points 23 hours ago ago | visit original

🤖 AI Summary

A groundbreaking update from KRAFTON AI introduces Terminus-KIRA, an enhanced agent harness designed to improve performance on the terminal-bench benchmark, which evaluates AI agents solving real tasks in a terminal environment. The original harness, Terminus 2, highlighted that existing models, primarily trained to assist humans, struggled with tasks requiring complete autonomy, often submitting partial results and misjudging their own completion capabilities. The new Terminus-KIRA addresses these shortcomings by implementing clear guidelines for agents to complete tasks without human intervention, enhancing self-evaluation processes, and integrating specialized tools for handling multimedia content. With these modifications, the early results indicate a significant performance boost of 10 percentage points. The changes include adjusting the agent's interaction model to reduce ambiguous task interpretations and optimizing the tmux interface to minimize wasted time. By open-sourcing Terminus-KIRA, KRAFTON AI aims to facilitate further advancements in AI performance, with future predictions suggesting that reaching over 80% accuracy on terminal-bench is achievable. This development marks a crucial step forward for the AI/ML community, emphasizing the need for agent designs that empower models to operate more independently in complex task environments.

Loading comments...

loading comments...