What LLM to use today? (gusarich.com)

0 points 228 days ago ago | visit original

🤖 AI Summary

The AI frontier just advanced quickly: OpenAI released GPT-5.1 and GPT-5.1‑Codex‑Max, Anthropic pushed Opus 4.5, and Google shipped Gemini 3 Pro (with Deep Think coming). The key takeaway isn’t a single winner—benchmarks touted in release posts (e.g., SWE‑bench) are misleading because they measure narrow tasks, not real-world coding skill. SWE‑bench, for example, sources 46% of tasks from one Django repo, is 87% bugfixes and contains 5–10% invalid tasks, so a small lead on that benchmark doesn’t translate to broad capability. The author argues instruction‑following and agentic behavior (the model’s ability to act autonomously across steps) matter far more for practical coding than single‑metric scores. Based on hands‑on experience, the models have distinct strengths: GPT‑5.1 excels at instruction following and general problem solving; GPT‑5.1‑Codex‑Max is best for large, well‑defined coding jobs; Opus 4.5 shines at implicit intent and ambiguous “vibe‑coding” but lacks deep reasoning; Gemini 3 Pro offers top raw math/reasoning but weaker instruction following. Practical advice: try models on real tasks, choose by use case, and combine models—e.g., use GPT‑5.1 Pro to draft a detailed implementation plan then feed it to GPT‑5.1‑Codex‑Max for implementation. Premium “Pro”/“Deep Think” tiers (roughly $200–$250/month) can help with hard reasoning but are slower and pricey.

Loading comments...

loading comments...