You can predict LLM output sensitivity in closed form (noahgolmant.com)

0 points 2 days ago ago | visit original

🤖 AI Summary

Recent research has introduced a method to predict the sensitivity of large language model (LLM) outputs to perturbations in the residual stream, enhancing our understanding of transformers at inference. This approach focuses on how far one can manipulate the residual stream—a vector responsible for generating next-token logits—without significantly altering the predictive distribution. Grounded in Janiak et al.'s 2024 findings on "stable regions" in embedding space, the study provides a closed-form solution that mathematically determines the largest perturbation in a specified direction while maintaining output stability. The resulting formula is not only computable from existing inference data but has demonstrated empirical utility across popular transformer architectures like Qwen, Llama, and Pythia. This advancement is significant for the AI/ML community as it enables better insight into model dynamics and predictive behavior, which could improve model calibration and mitigate potential biases in output. The predictions showed high accuracy, with results within 1% of empirical boundaries for small perturbations, and even reasonably accurate for larger perturbations after calibration, achieving up to 73% success on certain architectures. By providing a theoretical and empirical framework for understanding perturbation effects, this research could aid in fine-tuning model responses and enhance robustness in various NLP applications.

Loading comments...

loading comments...