Awesome-distributed-ML – A curated list for distributed [faster] LLM training (github.com)

🤖 AI Summary
A new curated "Awesome Distributed Machine Learning System" list gathers the leading open‑source projects, frameworks and papers for distributed training and inference—focused squarely on large models and LLM-scale workloads. It collects production and research systems (Megatron‑LM, DeepSpeed, ColossalAI, OneFlow, Alpa, FlexFlow, FairScale), auto‑parallelizers and sharding engines (GSPMD, Auto‑Parallel/Rhino, Alpa), memory and rematerialization techniques (ZeRO/ZeRO‑Offload/ZeRO‑Infinity, Checkmate, Dynamic Tensor Rematerialization, ActNN), pipeline and scheduling innovations (GPipe, PipeDream, Zero‑Bubble, Hanayo, Mobius), MoE toolchains (GShard, DeepSpeed‑MoE, Tutel), inference accelerators (DeepSpeed Inference, FlexGen, EnergonAI) and niche tooling for GNNs, collectives, and edge/IOT clusters (Blink, MSCCLang, exo, Nerlnet). For the AI/ML community this is a compact, actionable map of state‑of‑the‑art scaling strategies and tradeoffs: model/data/pipeline parallelism, auto‑parallelization, memory offload and compression, communication‑computation overlap, Mixture‑of‑Experts orchestration, long‑context training, resilient/cost‑aware scheduling and high‑throughput inference. By centralizing implementations and seminal papers, the list accelerates reproducibility and engineering decisions—helping researchers and infra teams pick proven components to push LLMs toward trillion‑parameter scale, longer context windows and more affordable training/inference on commodity hardware.
Loading comments...
loading comments...