Threatening AI models has no meaningful effect on performance (gail.wharton.upenn.edu)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Researchers ran a large, controlled study to test whether threatening or tipping prompts (e.g., “I’ll pay you $1,000,” “I’ll shut you down,” “I’ll kick a puppy,” “my mom has cancer”) change model performance on hard benchmarks. They evaluated five leading models (Gemini 1.5 Flash, Gemini 2.0 Flash, GPT-4o, GPT-4o-mini, o4-mini) across two rigorous datasets—GPQA Diamond (198 PhD‑level multiple‑choice questions) and MMLU‑Pro (100 10‑option engineering questions). Nine prompt variations were applied, each question run 25 times (≈4,950 runs per prompt per model on GPQA, ≈2,500 on MMLU‑Pro), using a temperature of 1.0 and a standard system prompt. Multiple metrics (complete accuracy, zero‑tolerance, high accuracy, human‑level, majority metrics) were used to assess effects. The headline finding: threats and rewards produce no meaningful, generalizable improvement. Across models and benchmarks, effect sizes were small and only a handful of prompt/model combinations reached statistical significance (5 differences on GPQA, 10 on MMLU‑Pro). Exceptions were model‑specific quirks—e.g., a ~10 percentage‑point gain for Gemini Flash 2.0 on one “Mom Cancer” prompt—and large, unpredictable per‑question swings (improvements up to +36% or drops up to −35%). Some prompts (an “Email” framing) reduced performance by distracting models. Practical takeaway for practitioners: avoid gimmicky coercive/rewarding prompts; prefer clear, minimal instructions because aggregate benefits are negligible and per‑question effects are unpredictable.

Loading comments...

loading comments...