🤖 AI Summary
Opus 4.5 was evaluated inside a real RAG (retrieval-augmented generation) pipeline using the exact same retrieval, context, and evaluation flow previously used to compare Gemini and GPT 5.1. Instead of relying on benchmark scores, the test focused on behavior under noisy retrieval — how models decide which retrieved snippets matter, avoid dumping context, and produce concise, grounded answers. Across five behavior probes (verbosity, handling unknowns, topical relevance, process explanation, and fidelity to retrieved text), Opus consistently sat between Gemini’s tendency to dump large chunks and GPT 5.1’s more expressive but sometimes tangential responses.
Technically, Opus is more structured than Gemini and clearer than GPT: it avoids uncontrolled text-dumping, organizes mixed-topic chunks coherently, and delivers the cleanest multi-step reasoning (notably better on process explanations). However, like the others, it still injects small side-notes and unnecessary citations when refusing or grounding, and it adds contextual framing beyond strict extraction. For RAG pipelines that need selective extraction plus readable, reliable reasoning, Opus 4.5 offers the best practical balance today — though teams needing verbatim, tightly grounded extracts should still be cautious about its tendency to add helpful-but-unasked-for context.
Loading comments...
login to comment
loading comments...
no comments yet