GPT-OSS from Scratch on AMD GPUs (github.com)

🤖 AI Summary
Open-source C++ implementation gpt-oss-amd brings OpenAI’s newly released GPT-OSS 20B and 120B models to AMD GPUs with no external ML libraries. Designed as a lightweight, library-free runtime (inspired by llama2.c), it uses HIP for custom kernels and avoids rocBLAS/hipBLAS/RCCL/MPI, aiming to make end-to-end LLM inference optimizations—from kernel to system—transparent and easy to adapt. The project is MIT-licensed and packaged with build/run scripts, model conversion tools (safetensors→.bin), a tokenizer compatible with OpenAI o200k_harmony via tiktoken, and example serving modes (chat, single-prompt, batch). Technically, gpt-oss-amd implements multi-streaming, batching, multi-GPU communication, optimized CPU–GPU–SRAM memory access, FlashAttention, matrix-core–based GEMM, and MoE routing load balancing, all via custom HIP kernels. On a single node with 8× AMD MI250 GPUs it reports >30k TPS for the 20B model and ~10k TPS for the 120B model in custom benchmarks, demonstrating AMD’s strong potential for large-scale inference without NVIDIA/CUDA. For researchers and practitioners, this fills a tooling gap in the AMD ecosystem, offers a reproducible platform for low-level LLM optimization, and lowers the barrier to experiment with large-model inference on non‑NVIDIA hardware.
Loading comments...
loading comments...