🤖 AI Summary
NVIDIA and Ollama performance tests on DGX Spark (firmware 580.95.05, Ollama v0.12.6) measured throughput across a set of models and quantizations to give practical numbers for deployment tuning. Tests ran 10× with temperature 0, outputs constrained to 500 tokens, caching disabled, and a fixed “write an in-depth summary…” prompt using an excerpt from A Tale of Two Cities. The results emphasize differences between prefill (context-processing) and decode (autoregressive generation) throughput and show how model size and quantization mode materially affect latency and token rates.
Key results: smaller models and 4-bit quantization delivered the highest prefill rates (llama3.1 8B q4_K_M reached 7.614k tokens/s prefill and 38.02 tokens/s decode), while gpt-oss 20B MXFP4 showed strong decode speed (58.27 tokens/s) and gpt-oss 120B MXFP4 achieved 1.169k prefill / 41.14 decode, fitting fully in the DGX Spark’s 120 GB VRAM on the GB10 Grace Blackwell Superchip. Quantization trade-offs are clear: q4_K_M often outperforms q8_0 (e.g., gemma3 12B q4_K_M = 1.894k prefill vs q8_0 = 1.406k), and some MXFP4 GGUFs distributed online are further quantized to q8_0 in attention layers while Ollama uses BF16 for those layers as intended by OpenAI.
Implications: these real-world benchmarks help ops teams choose model/quantization pairs for target throughput vs quality, and they underline the value of updating DGX firmware (580.95.05+) for best performance. The test scripts are available for replication, and Ollama/Codex integrations are shown as simple paths to run gpt-oss models on DGX Spark.
Loading comments...
login to comment
loading comments...
no comments yet