METR review of OpenAI's GPT-OSS fine-tuning safety methodology (metr.org)

🤖 AI Summary
Under an NDA, METR conducted a focused methodological review of OpenAI’s adversarial fine‑tuning (malicious fine‑tuning, MFT) experiments used to assess whether gpt-oss-120b could be pushed to “High” dangerous‑capability thresholds in OpenAI’s Preparedness Framework. METR produced 17 recommendations (6 high‑urgency) aimed at improving elicitation of dangerous capabilities and adding evaluations relevant to catastrophic risk in two tracked categories: Biological & Chemical and Cybersecurity. Key asks included robustness checks on ProtocolQA, training on biology datasets analogous to ProtocolQA, inference‑time scaling plots for bio/cyber evals, clearer threat‑model assumptions for low‑resource actors, and quantification of refusal behavior before/after MFT. OpenAI reported adopting 9 of the 17 recommendations and METR judged five of the six high‑urgency items implemented (one partially). Concrete changes included fixing an earlier ProtocolQA overestimate and rerunning evals, adding a synthetic biology‑protocol dataset derived from o3 outputs to MFT training, publishing some scaling plots, and detailing anti‑refusal training and refusal‑rate graphs. Significant gaps remain: OpenAI has not disclosed the concrete, pre‑registered numeric thresholds or justification used to classify models as “High” (they say determinations are holistic and cite internal pre‑registered thresholds). The review underscores progress in hardening open‑weight evaluation methodology but highlights the continuing need for transparency around threat models and prespecified risk thresholds to enable external scrutiny of catastrophic‑risk claims.
Loading comments...
loading comments...