Recommendation List for Trending Open Source Model Providers (docs.google.com)

🤖 AI Summary
This recommendation ranks trending open‑source model providers by evaluation metrics, time‑to‑first‑token (“aha” moment), pricing, uptime and quantization, and is intended to drive Cline-provider’s Open Router routing decisions. Standouts: Groq (no quantization) delivers the best perceived quality and the fastest experience (≈1058 tps, 0.2 s latency) and uniquely supports prompt caching — cutting effective cost from $1 input to $0.50 on cache hits and producing the “magical AHA.” Fireworks is a reasonable fp8 fallback (≈105 tps, 0.8 s, $0.60). For coding, the Qwen3 Coder (480B A35B) is recommended, with Nebius (fp8, 84 tps, 0.48 s, $0.40) and BaseTen (fp8, 104 tps, 0.84 s, $0.38) as provider options—BaseTen is cheaper but has higher first‑token latency. Other notes: GLM 4.6 and variants are fp8/bf16 mixes with moderate throughput and varied latencies and prices (cache hits can drastically reduce per‑call costs). DeepSeek’s best model (deepseek-v3.2-exp) is slower (≈25–28 tps, 1.05–1.55 s) but very cheap post‑cache and enjoys heavy real‑world traffic. GPT-oss-120B shows both very high throughput configurations (≈954 tps, 0.27 s, $0.15; cache can drop cost further) and slower, ultra‑cheap options. Bottom line: route by workload — prioritize no‑quant / low‑latency providers and prompt caching for latency‑sensitive, high‑quality use cases; use fp8/quantized providers for cost‑sensitive or fallback capacity.
Loading comments...
loading comments...