Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark (simonwillison.net)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Google today shipped Gemini 3 Pro, a multimodal, long-context upgrade that the author describes as “Gemini 2.5 upgraded to match the leading rival models.” Key specs: Jan 2025 knowledge cutoff, support for up to 1 million input tokens and 64k output tokens, and multimodal inputs across text, images, audio and video. Google’s model card reports Gemini 3 Pro edging out Claude Sonnet 4.5 and GPT-5.1 on many public and internal benchmarks (notably big improvements on multimodal and long-context tasks: MMU/Video-MMMU ~81–88%, ScreenSpot-Pro 72.7% vs 11.4% for Gemini 2.5, MRCR 1M pointwise 26.3% where competitors don’t support 1M). Pricing sits between Gemini 2.5 Pro and Claude Sonnet 4.5 with tiered per-token rates (cheaper than Claude Sonnet but more than Gemini 2.5 for many usages). In hands-on tests via AI Studio the model handled complex image-to-JSON conversions and produced a robust alt-text table, and it transcribed a 3h33m city council meeting after compressing the audio (initial large-file run produced an “internal error”). The transcript included structured outlines, timestamps and speaker labels but showed fidelity issues (omitted verbatim Spanish interpreter text, misaligned timestamps and truncated ending), highlighting practical strengths and remaining reliability gaps for long-form audio and strict verbatim needs. Overall, Gemini 3 Pro looks like a substantive multimodal/agentic step forward — strong for math, coding and multimodal reasoning — but independent benchmark verification and production-grade robustness (especially for long audio/precise timestamps) will determine adoption.

Loading comments...

loading comments...