Parallax: Make your local AI go brrr (github.com)

🤖 AI Summary
Gradient released Parallax v0.0.1 — a fully decentralized inference engine that lets you stitch together an AI cluster from heterogeneous, distributed nodes (personal devices, workstations, cloud GPUs). It’s designed for local hosting of LLMs with cross‑platform support and features focused on throughput and latency: pipeline-parallel model sharding, dynamic KV-cache management and continuous batching (macOS-optimized), plus dynamic request scheduling and routing. The release targets practitioners who want private, low-latency inference across mixed hardware and network environments while supporting many open models (Qwen, Llama, gpt-oss, etc.). Technically, Parallax requires Python 3.11–<3.14 and recommends Ubuntu 24.04 for Blackwell GPUs; Linux installs support source and Docker workflows, macOS (Apple silicon) installs via pip in a venv, and Windows gets an installer (requires admin Powershell). GPU Docker images are provided (gradientservice/parallax:latest-blackwell and :latest-hopper) and should be run with --gpus all. The runtime is scheduler-driven (parallax run → parallax join → parallax chat) with a web UI on ports 3001/3002 and an HTTP chat completions API; telemetry can be disabled with -u. Advanced users can run nodes without a scheduler by manually assigning start/end layers for pipeline parallelism (example shown for Qwen3-0.6B). Parallax’s combination of decentralized orchestration and pipeline-level sharding makes it a practical option for scaling inference across edge-to-cloud heterogeneous fleets.
Loading comments...
loading comments...