How well can large language models predict the future? (forecastingresearch.substack.com)

0 points 9 hours ago ago | visit original

🤖 AI Summary

The Forecasting Research Institute released an updated ForecastBench—an automated, biweekly LLM forecasting benchmark—and new results showing steady LLM progress on predicting real-world events. Key findings: superforecasters still lead with a difficulty-adjusted Brier score of 0.081, while the best LLM (GPT‑4.5) scores 0.101, a modest gap that’s been shrinking. LLMs now outperform non‑expert public forecasters (the public median fell from #2 to #22), and linear trends imply parity with superforecasters around late 2026 (95% CI: Dec 2025–Jan 2028). ForecastBench now runs 500 questions per round (250 dataset questions from sources like FRED, ACLED, Wikipedia; 250 market questions from Manifold, Metaculus, Polymarket, etc.), updates nightly, and is open for submissions. Technically, ForecastBench introduces a difficulty‑adjusted Brier score to fairly compare forecasters who answer different question sets, and it separates Baseline (out‑of‑the‑box LLMs) and Tournament (models allowed market info) leaderboards. Aggregate improvement is ~0.016 Brier points/year; dataset questions improve ~0.020/year (parity projected June 2026) and market questions ~0.015/year—though baseline market forecasts (no access to market prices) improve faster (~0.036/year). The report also flags a shortcut: some LLMs simply copy provided market forecasts (GPT‑4.5’s correlation with market inputs is 0.994), inflating apparent skill. The update highlights forecasting as a robust, contamination‑free proxy for LLM reasoning with immediate practical implications for decision support.

Loading comments...

loading comments...