Can we bootstrap AI Safety despite being unable to even define it? (arxiv.org)

0 points 246 days ago ago | visit original

🤖 AI Summary

Researchers propose "consensus sampling," an architecture‑agnostic ensemble method that boosts safety by aggregating outputs from k generative models and returning an answer only when a chosen subset size s shows sufficient agreement. The scheme treats safety as empirical risk: the algorithm guarantees risk competitive with the average risk of the safest s models among the k, and it abstains (refuses to answer) when models disagree. It requires models to report output probabilities, derives bounds on the probability of abstention when enough models are “safe” and overlap in their outputs, and is mathematically inspired by a prior provable copyright‑protection algorithm. This is significant because it gives a provable, model‑agnostic way to "amplify" safety from an unknown safe subset into a single reliable decision rule, offering a practical bootstrap when formal definitions of safety are elusive. Key limitations and implications: it provides no protection if all models are unsafe, depends on overlap among safe models, can accumulate risk over repeated uses, and trades coverage (more abstentions) for lower risk. Practically, consensus sampling suits settings where models can expose probability scores (or logits) and where abstention is acceptable; it complements—rather than replaces—inspection‑based and specification‑driven safety tools.

Loading comments...

loading comments...