We didn't get the AI failure modes that philosophy anticipated (cjauvin.github.io)

🤖 AI Summary
Early AI and philosophy imagined systems built from formal logic and idealized reasoning, so their failure modes were expected to be paradoxes or deep, principled contradictions (think Asimov’s robots endlessly looping or HAL’s mission-driven cognitive dissonance). Modern generative models like ChatGPT, however, fail in ways that were harder to foresee: not by tripping over logical paradoxes but by producing inconsistent, non-deterministic, and confidently wrong outputs (hallucinations, formatting or citation errors, contradictory answers across turns). Those behaviors stem from the models’ core mechanics—they’re large-scale statistical pattern predictors trained to continue token sequences, with training objectives, sampling strategies, and post-training interventions (e.g., RLHF) that can amplify brittleness, ambiguity, and calibration problems rather than guarantee formal correctness. For the AI/ML community this mismatch matters: it forces a shift from classical symbolic error analysis to new evaluation and mitigation strategies. Practically, that means prioritizing consistency benchmarks, uncertainty estimation, grounding (retrieval or tool use), hybrid symbolic-neural architectures, and system-level monitoring to catch distribution shifts and prompt-induced ambiguity. It also reframes safety work: alignment and RLHF reduce some harms but introduce trade-offs in behavior predictability, so researchers must develop better taxonomies of failures, techniques for verification and provenance, and deployment controls to manage user expectations and trust.
Loading comments...
loading comments...