Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers (arxiv.org)

🤖 AI Summary
Recent research challenges a widely held assumption in the AI community regarding the relationship between the complexity of harnesses used in Large Language Model (LLM) agents and their capability tiers. Conducted through extensive experiments involving six models across various tiers, the study found that the expected monotone inverse relationship between model capability and optimal harness complexity does not hold. Specifically, for advanced models like Gemini 2.5 Flash, increased harness verbosity significantly reduced the model's reliability. Conversely, the Qwen 3.5-122B reasoning model performed best under a strict harness, achieving a high validation task success rate (VTSR) of 91.7%, contradicting the anticipated need for simpler structures in higher-capability models. These findings carry profound implications for harness design, suggesting that the sensitivity of models to structural guidance is far more complex than previously understood. The research introduces a failure taxonomy that highlights different types of errors based on model capabilities, indicating that higher-capability models often face distinct challenges compared to their lower-tier counterparts. This calls for refined, tier-aware harness selection strategies to optimize model performance, indicating a pivotal shift in the approach to deploying LLM agents across a spectrum of applications.
Loading comments...
loading comments...