New Model Scheduling (ollama.com)

🤖 AI Summary
Ollama has rolled out a redesigned model scheduling engine that measures a model’s exact GPU memory needs before execution (instead of using conservative estimates). That precision eliminates over‑allocation and dramatically reduces out‑of‑memory crashes, allows the system to grant more GPU memory to models (improving token throughput), and schedules layers across multiple or mismatched GPUs more efficiently. Memory reporting now lines up with system tools (nvidia-smi vs. ollama ps), making utilization transparent and predictable. The new memory-management behavior is enabled by default for all models running on Ollama’s new engine, with more models transitioning soon. Practically, the change lets Ollama load full model layers onto GPUs and boost both prompt evaluation and token generation for large-context and multimodal workloads. Examples: gemma3:12b on a single RTX 4090 with 128k context rose from 52.02 to 85.54 tokens/s and loaded 49/49 layers (VRAM ~19.9 → 21.4 GiB). Mistral‑small3.2 with image input on two RTX 4090s saw prompt evaluation jump from 127.84 to 1,380.24 tokens/s and token gen from 43.15 to 55.61 tokens/s while adding the vision model on GPU. For practitioners, this means more reliable deployments, higher throughput for long-context and multimodal models, and smoother scaling across heterogeneous GPU clusters.
Loading comments...
loading comments...