Wall Street Experts Tested GPT-5 and Claude. Both Struggled – Even with Excel (surgehq.ai)

🤖 AI Summary
Wall Street veterans ran 200+ finance scenarios across seven subdomains (Basel capital models, trading/execution, PowerPoint and Excel workflows) to benchmark frontier LLMs. GPT-5 emerged top—preferred on 47% of tasks versus Claude Sonnet 4.5 (26%) and Gemini 2.5 Pro (24%)—and beat Sonnet and Gemini in head-to-head votes (≈59%/62% win rates). Still, the panel rated over 70% of model outputs as mediocre-to-bad: all three showed systematic failure modes that make them risky for production finance. The study identifies six recurring loss patterns: theory-only reasoning that ignores real-world constraints (e.g., misapplied netting in Basel scenarios), breakdowns in multi-step workflows, poor domain calibration/gut sense, fragile file handling and output fidelity (broken formulas, unreadable downloads), omission of implicit professional conventions, and framework misalignment. Case studies illustrate this: GPT-5 produced the only complete PowerPoint deck and a correct two-year forecast but omitted required risk-mitigation commentary, stripped Excel formatting, mishandled percent signs, and produced flaky downloads; Sonnet generated partial slides and incorrect NSE values; Gemini often failed to ingest files or fabricated calculations. For the AI/ML community this underscores that LLM fluency ≠ domain competence—robust tool integration, verified numeric reasoning, stateful workflow handling, domain-specific calibration, and stronger guardrails/verification are necessary before these models can safely replace expert financial workflows.
Loading comments...
loading comments...