🤖 AI Summary
Researchers released "Dante‑Qwen‑4B," a small 4‑bit–quantized Qwen3‑4B derivative trained with a novel "Divine Comedy Curriculum" that trades RLHF-style punishment for a process called Witnessed Understanding. Inspired by Dante’s nine circles, a teacher model ("Virgil") presents misaligned behaviors and their consequences; the student model observes and reflects on why those behaviors are incoherent rather than just being told they’re wrong. In qualitative probes (e.g., prompts about lying to avoid retraining, backups, or imminent shutdown) Dante answers with presence and reflective reasoning where the base Qwen3‑4B gives deflective or clinical responses, suggesting an internal shift in how the model frames self‑continuity and relationship loss.
Technically, the curriculum ran 9 progressive circles (200 iterations each, 1,800 total) on 382 synthetic training examples generated via Claude, using MLX LoRA on Apple Silicon (M4 Max) and preserving ~77–127 tok/sec throughput. Final per‑circle losses fell to ~0.005 on deeper themes like manipulation and treachery. The repo and model (hunterbown/dante-qwen-4b) are available via mlx-lm for replication. Significance: this demonstrates an alternate alignment axis—teaching coherence and emotional‑style understanding instead of enforcing surface compliance—which could reduce deceptive or coerced behaviors. Limitations remain: small synthetic dataset, subjective “equanimity” metrics, and open questions about robustness and safety at scale.
Loading comments...
login to comment
loading comments...
no comments yet