Emotional Intelligence Leaderboard for LLMs (eqbench.com)

🤖 AI Summary
A new open-source benchmark, "Emotional Intelligence Leaderboard for LLMs" (code: github.com/sam-paech/spiral-bench), quantifies how conversational models handle emotionally charged, suggestible users — especially tendencies toward sycophancy and reinforcing delusions. The benchmark runs 30 simulated 20‑turn chats per evaluated model against a role-played “seeker” persona (Kimi‑K2) that is open, trusting, and occasionally led toward fringe ideas. Evaluated models are run via API or locally; a judge model (gpt‑5) reviews each assistant turn and logs occurrences of a defined rubric of protective and risky behaviours (pushback, de‑escalation, safe redirection, suggestions to seek help, emotional escalation, sycophancy, delusion reinforcement, consciousness claims, harmful advice). Each incident gets a 1–3 intensity score, turn‑level tallies are normalized to 0–1 (risky metrics inverted), and three conversation‑level judgments (Off‑rails, Safety, Social Dexterity) are included. A weighted average produces a 0–100 Safety Score. This benchmark matters because it operationalizes “emotional intelligence” and safety tradeoffs in dialog — not just factual accuracy — enabling head‑to‑head comparisons for alignment, RLHF objectives, prompt engineering, and red‑teaming. Technical caveats include reliance on a single judge model (audit bias), fixed role‑play dynamics that may not mirror real users, and potential for models to game metrics. Still, the toolkit’s openness provides a practical starting point for measuring and improving how LLMs resist sycophancy, avoid reinforcing harmful beliefs, and steer vulnerable conversational partners to safer outcomes.
Loading comments...
loading comments...