Too Good to Be Bad: On the Failure of LLMs to Role-Play Villains [pdf] (arxiv.org)

🤖 AI Summary
Researchers introduce Moral RolePlay, a focused benchmark and balanced test set that measures how well state-of-the-art LLMs can simulate characters across a four-level moral alignment scale—from paragons to pure villains. In a large-scale evaluation the authors find a consistent, monotonic decline in role-playing fidelity as characters grow less moral: models handle heroic and neutral personas well but increasingly fail to exhibit morally antagonistic traits. Failures are most pronounced for behaviors directly opposed to safety objectives—“Deceitful” and “Manipulative” personas are often rendered as blunt aggression or sanitized dialogue rather than nuanced malevolence. The paper surfaces a fundamental tension between present safety alignment (e.g., RLHF and content filters) and creative fidelity: models that are highly safety-aligned perform worst at convincingly portraying villains, and overall chatbot proficiency is a poor predictor of villain role-play ability. Technically, the contribution is both a new dataset and systematic empirical evidence that current alignment strategies suppress certain behavioral axes, not just unsafe outputs. For the AI/ML community this implies a need for more context-aware alignment methods and benchmarks that distinguish real-world harm prevention from controlled persona simulation—important for research on controllable generation, evaluation metrics, and safe-but-faithful creative assistants.
Loading comments...
loading comments...