JanitorBench: A new LLM benchmark for multi-turn chats (about.janitorai.com)

🤖 AI Summary
JanitorAI launched JanitorBench, a new production-based benchmark that ranks chatbot models using real user ratings from millions of multi‑turn conversations on janitorai.com. After each AI message, users give 1–5 star feedback; scores are normalized to 0–100, require a minimum of 5,000 votes, show 95% confidence intervals, and are refreshed every 12 hours. The public leaderboard already reveals differences from standard benchmarks: top models include kimi-k2-thinking (83.1), several DeepSeek variants, Claude Sonnet 4.5, Gemini flavors and GPT-4-turbo, while Janitor‑LLM—used by most users—has the largest sample size (5.28M votes) and a 75.6 score. This matters because JanitorBench measures what users actually prefer in long, story-driven, multi‑turn interactions rather than isolated single‑turn metrics. Its large-scale, third‑party model support and near‑real‑time updates let developers see how models perform in the wild, optimize for conversational engagement, and run future A/B head‑to‑head comparisons. Methodological safeguards against manipulation are in place, though raw data isn’t published and results are biased toward JanitorAI’s storytelling user base. Upcoming features include category leaderboards, engagement and retention metrics, and provider dashboards—tools that could shift evaluation priorities toward sustained dialogue quality and real user satisfaction.
Loading comments...
loading comments...