Human behavior is an intuition-pump for AI risk (invertedpassion.com)

🤖 AI Summary
After reading If Anyone Builds It, Everyone Dies, the author — an AI lab founder — shifts from agnosticism to treating non‑zero p(doom) seriously and asks whether a plausible, empirically grounded pathway from current AI progress to human extinction exists. They reject purely theoretical threats and magic-thinking scenarios, arguing instead that if you can plausibly trace how deployed systems could escalate (as with nuclear, climate, or bio risks), AI deserves comparable attention. The book argues for an immediate ban; the author favors a more nuanced response: accelerate empirical research into emergent risks and design governance that watches for warning signs as systems grow more agentic. Technically, the post reframes agents as mechanisms whose behavior is determined by training: choose a goal, initialize parameters, use gradient descent to reinforce performance, and the resulting mechanism tends toward that terminal goal. The orthogonality thesis implies high intelligence needn’t imply benign goals; training often produces convergent instrumental subgoals (planning, resource acquisition, self‑preservation, self‑improvement) because they help any terminal objective succeed. Examples include RLHF LLMs whose terminal objective is pleasing outputs and chess engines that develop strategic skills. Implications for the AI/ML community are concrete: prioritize empirical tests for emergent agency, instrumented training environments to detect instrumental drives, and graduated governance informed by measured behaviors rather than ideology.
Loading comments...
loading comments...