Show HN: I wrote inference for Qwen3 0.6B in C/CUDA (github.com)

πŸ€– AI Summary
A hobbyist developer has released qwen.c β€” a minimal C/CUDA inference implementation for Qwen3-0.6B that builds into a shared library and is driven via a small Python front end. The repo loads a safetensors model, constructs KV cache and RoPE matrices (for a max context length of 2048), and produces token outputs via a simple argmax decoder. It currently targets CUDA-only, uses mostly naive CUDA kernels, and requires tweaking a hardcoded layer/head count in json.c to support other Qwen3 variants. Build targets produce a library used by chat.py, while run.c provides a basic CLI entrypoint that prints generated tokens. This is significant as a compact, hands-on reference for low-level transformer inference: it’s useful for learning C/CUDA ML programming and for experimentation (e.g., kernel optimization, quantization, CPU offload, dynamic KV caching, improved sampling like top-k/top-p, and better memory management). However, it’s not production-grade β€” decoding uses greedy argmax (risking repetition), KV sizing is fixed at init, only safetensors are supported natively, and many performance/feature improvements remain to be implemented. The project is MIT-licensed and presents many clear extension points for contributors interested in inference engineering and accelerator-level optimizations.
Loading comments...
loading comments...