Show HN: Everything it took to run an LLM at 10k tok/s on H200s (www.relace.ai)

0 points 1 day ago ago | visit original

🤖 AI Summary

Relace open-sourced the engineering and data pipeline behind Relace Apply 3 — a small, specialized “apply” LLM that runs at 10k+ tokens/sec on Nvidia H200s and reliably merges “lazy” diffs into existing code. The core idea: instead of having a frontier model re-generate entire files, generate a minimal diff with the frontier model and use a tiny, fast LLM to infer intent and apply the diff. This saves latency and cost for coding agents (avoiding 100+ second rewrites of 10k-token files) while handling real-world pathological diffs that fixed string algorithms miss. Key technical takeaways: they built a high-quality training set by snapshotting real prompt-to-app contexts, distilled merged outputs from a frontier teacher with rejection sampling, and bootstrapped scaling with an LLM-as-a-judge aligned on 500 human-annotated examples to reach ~1% false-positive filtering. After syntactic and regex filtering, the final training set was ~145k examples across many languages. Apply models are 3–8B open-source bases fine-tuned with LoRA (rank 128, alpha 32, LR 5e-5, AdamW), trained in BF16 on a single H200 via Modal with up to 64k context, then converted to FP8 using llm-compressor to exploit H200 FP8 cores. The result: a fast, robust merge model that preserves base coding knowledge, reduces end-to-end agent latency and cost, and is practical to re-run at production scale.

Loading comments...

loading comments...