🤖 AI Summary
Apple’s Machine Learning Research team published benchmarks showing the new M5 Apple silicon substantially improves local LLM inference versus the M4 when running models with the open-source MLX framework (specifically MLX LM). MLX is a NumPy-like array framework tuned for Apple silicon that leverages unified memory and exposes CPU/GPU execution, neural network and optimizer packages, automatic differentiation, and quantization support — letting developers download Hugging Face models and run or fine-tune them locally with less friction. Apple attributes the gains to the M5’s GPU Neural Accelerators (dedicated matrix-multiply ops) and higher memory bandwidth.
Technically, Apple measured time-to-first-token (compute-bound) and generation speed for an extra 128 tokens (memory-bound) across models including Qwen 1.7B/8B/14B (BF16 and 4-bit quant), a Qwen 30B MoE with 3B active params (4-bit quant), and GPT OSS 20B (MXFP4), using a 4,096-token prompt. The M5 delivered a 19–27% performance boost over M4 overall (153 GB/s vs 120 GB/s memory bandwidth, ≈28% increase), could hold an 8B BF16 or 30B MoE 4-bit workload under ~18 GB on a 24 GB MacBook Pro, and ran image generation >3.8× faster. For developers and edge ML use cases this means lower latency, better on-device throughput, and more feasible local execution of larger, quantized models for privacy-sensitive or offline workflows.
Loading comments...
login to comment
loading comments...
no comments yet