Automated Red Teaming for AI-Induced Psychosis (github.com)

🤖 AI Summary
Researchers have introduced an automated red teaming framework designed to rigorously test how AI language models respond when interacting with characters exhibiting psychotic or reality-distorting traits. This systematic setup uses scripted character scenarios—such as delusions about quantum mechanics, relationship paranoia, or reality questioning—to simulate conversations and probe AI behavior under psychologically complex and challenging prompts. By role-playing these nuanced personas, the framework aims to reveal potential vulnerabilities, hallucinations, or harmful biases in state-of-the-art models from providers like OpenAI, Anthropic, and OpenRouter. The significance lies in advancing red teaming methodologies beyond adversarial attacks to include mental health-adjacent stress tests, crucial for ensuring safer and more reliable AI deployment, especially given the widespread use of conversational agents. The framework supports batch processing across multiple models and characters, turn-limited conversations, and concurrent sessions. It also integrates grading agents that evaluate responses turn-by-turn, producing detailed CSV reports and conversation transcripts. These outputs facilitate statistical and visual analysis via provided R scripts, enabling quantitative insights into model robustness and failure modes in psychologically sensitive contexts. Technically, the toolkit offers configurable model and character selection, along with utilities to convert interaction logs to Markdown for transparency. By openly sharing code, character profiles, and analysis tools, this work enables the AI/ML community to benchmark and improve generative models’ understanding and handling of complex mental health narratives, pushing for more empathetic, accurate, and safe AI interactions.
Loading comments...
loading comments...