Learning from Failure to Tackle Hard Problems (blog.ml.cmu.edu)

🤖 AI Summary
Researchers introduced BaNEL (Bayesian Negative Evidence Learning), a post-training method that teaches generative models to improve from failures alone—without ever seeing positive-reward examples and while minimizing costly reward evaluations (NREs). BaNEL trains a separate likelihood-based generative model p_phi on negative (reward=0) samples, then defines a rejection region via the likelihood ratio p_theta(x)/p_phi(x) < τ to approximate the set of failures. The base model p_theta is updated by conditioning on the complement of this learned failure region (a Bayesian-style posterior that filters out samples similar to prior failures). This recursive, online procedure accumulates rejection regions across rounds so the model increasingly avoids previously observed mistake patterns. Technically, BaNEL leverages maximum-likelihood training of p_phi, computable likelihoods (so it requires autoregressive/likelihood models), and a likelihood-ratio threshold to trade offline compute for dramatic reward efficiency. In a constrained adversarial attack on a digit-addition transformer (only 7,500 reward queries, initial success ~0.0004), BaNEL boosted success rates by ~278× and discovered concrete failure modes (leading zeros, carry chains) that enabled near-perfect rule-based attacks. It also improved reasoning on GSM8K subsets with fewer NREs than novelty-based baselines (RND/pseudo-count). The method is compute-hungry but well-suited where reward signals are extremely sparse or expensive, offering a principled way to turn negative evidence into exploration and capability gains.
Loading comments...
loading comments...