Metal FlashAttention v2.5 with Neural Accelerators on the Apple M5 Chip (releases.drawthings.ai)

0 points 4 hours ago ago | visit original

🤖 AI Summary

Draw Things has released Metal FlashAttention v2.5 with Neural Accelerators (preview) in Draw Things 1.20251107.1, bringing up to 4.6× end-to-end speedups on Apple’s new M5 silicon versus M4 and delivering the fastest performance on non‑Pro M-series chips. By exploiting Apple’s Neural Accelerators for core operators — matrix multiply, attention, and segmented matrix multiplication (crucial for Mixture‑of‑Experts models) — and pairing that with aggressive memory-management and Metal-level optimizations beyond MPSGraph, the team reports 3.6–5.5× raw gains and 3.3–4.6× end-to-end improvements on an M5 iPad. Practical results include sub‑minute high‑resolution image generation for large models (12B–20B) and a 5‑second 480p video generation demo on a 16 GB M5 iPad. Technically significant because it shows on‑device Apple silicon can rival or exceed prior Max chips (sometimes outperforming M2 Max and narrowing the gap to M3 Ultra) for large multimodal workloads, enabling faster local inference and MoE support. The preview has caveats: BF16 is disabled, shader specialization can take ~10s, and there are performance cliffs for odd attention lengths and very large head dims; adequate cooling was used for peak numbers. Source code and implementation are published on GitHub, with deeper engineering details promised in a follow‑up post.

Loading comments...

loading comments...