Small Model Forensics (blog.0xmmo.co)

🤖 AI Summary
In a recent analysis involving 2,000 API calls to nine small closed-weight AI models from three different providers, significant insights into inference scaling were discovered. The tests revealed key performance differences; for instance, gpt-4.1-nano excelled in handling tiny, sub-second queries, while gemini-3.1-flash-lite outperformed others on large queries exceeding 600KB. Notably, the first-token latency did not scale linearly with input size, challenging typical assumptions in model performance. The findings suggest that as context size increases, some models experience a counterintuitive reduction in decode costs, with models like gemini-3.1-flash-lite demonstrating remarkably flat latency curves. This indicates provider-specific optimizations that can enhance performance dramatically at scale; for example, gemini-3.1-flash-lite showed a 4,200-fold increase in context size with only a 7x increase in wall time. Moreover, discrepancies in token counts across providers highlight the importance of understanding these differences for accurate cost estimations in model use. Overall, these insights will help the AI/ML community make more informed decisions about model selection based on specific use cases and performance requirements.
Loading comments...
loading comments...