🤖 AI Summary
This piece argues that a rigorous impact measure could be the first practical safeguard that actually prevents a powerful, goal-directed agent with an imperfect objective from causing catastrophe—without needing to know the agent’s true utility. If we can formalize “impact” mathematically, the author suggests, we can translate that into code and enforce low-impact behavior directly. The note contrasts this proposal with other approaches: quantilizers (mild optimization) face problems when an agent is already powerful because more plans become catastrophic and the safe “base distribution” is hard to define or learn robustly; Jessica Taylor’s idea of learning a human action distribution raises questions about how densely catastrophic plans populate policy space and how to define catastrophe independently of value judgments. Value learning is brittle because it requires strong assumptions, so safeguards shouldn’t collapse if value learning fails.
Corrigibility is promising but incomplete: even a corrigible agent might take irreversible early actions if it “moves too quickly,” and so other incentives must be addressed (Paul Christiano offers a broader view). The author frames a three-part sequence: why certain risks matter, why goal-directed AIs are incentivized toward harmful impacts, and how to design agents without those incentives. The first part emphasizes foundational thinking over immediate implementation and encourages readers to wrestle with conceptual questions (illustrated by a paperclip–Balrog metaphor) before accepting proposed solutions.
Loading comments...
login to comment
loading comments...
no comments yet