K2VV: Wild Precision Gaps Across "Kimi K2" API Vendors (github.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

Moonshot AI announced K2 Vendor Verifier (K2VV), a monitoring and evaluation suite that measures how faithfully third‑party Kimi K2 API providers execute "toolcall" behavior — a critical capability for agentic/looped systems. The launch responds to community feedback showing large precision gaps across vendors: while many providers prioritize latency and cost, subtle differences in toolcall accuracy are materially affecting user experience and benchmark results. K2VV runs periodic, reproducible tests and publishes per‑vendor diagnostics so users can choose providers that match the official K2 behavior. Technically, K2VV ran 2,000 requests per provider and reports finish_reason breakdowns (stop, tool_calls, others), schema validation error counts, successful tool calls, and a similarity score computed as 1 − normalized Euclidean distance to the official Moonshot API. Results show big variation: Moonshot AI Turbo nearly matches the reference (99.29% similarity; 513 successful tool calls), several vendors cluster ~95–97% (NovitaAI, SiliconFlow, Volc, DeepInfra), while Baseten, Together and AtlasCloud lag markedly (72.2%, 64.9%, 61.6% similarity respectively). Some vendors also produced nontrivial schema validation failures. K2VV includes tooling and sample data (samples.jsonl) for providers to self‑test and for users to replicate results; vendors can request inclusion or contact Moonshot for remediation.

Loading comments...

loading comments...