Poetiq achieves 75% at under $8 / problem using GPT-5.2 X-High on ARC-AGI-2 (poetiq.ai)

0 points 228 days ago ago | visit original

🤖 AI Summary

Poetiq announced new state-of-the-art results on the ARC-AGI benchmark, establishing fresh Pareto frontiers for both ARC-AGI-1 and ARC-AGI-2 that trade off accuracy and cost more effectively than prior systems. Using recent models (GPT-5.1 and Gemini 3) in mixed configurations, Poetiq’s meta-system both beats expensive “deep think” runs and produces extremely low-cost points—e.g., a GPT-OSS-120B-based variant that runs for under $0.01 per problem and a Grok-4-Fast configuration that is cheaper yet more accurate than the model’s reported baseline. Their ARC-AGI-2 score even exceeds the average human test-taker (~60%). All code for the configurations is being open-sourced. Technically, Poetiq is a model-agnostic, recursive meta-system that programmatically composes multiple LLM calls, decides when to generate code, self-audits progress, and adaptively stops to minimize waste. Key wins come from discovering simple multi-call strategies (notably with repeated Gemini-3 calls) that achieve pareto-optimal solutions across operating regimes and often improve accuracy while reducing cost by making fewer than two requests on average. Importantly, the system’s adaptations were done using only open-source models prior to Gemini 3/GPT-5.1 releases and generalize across model families (GPT, Claude, Gemini, Grok, GPT-OSS), suggesting strong transfer and practical implications for cost-efficient reasoning pipelines in research and production.

Loading comments...

loading comments...