Exposing biases, moods, personalities, and abstract concepts hidden in LLMs (news.mit.edu)

0 points 4 days ago ago | visit original

🤖 AI Summary

A team of researchers from MIT and the University of California San Diego has developed a novel method to identify and manipulate hidden biases, moods, personalities, and abstract concepts within large language models (LLMs) like ChatGPT and Claude. This new approach allows researchers to isolate specific concepts encoded in LLMs and adjust their prominence in generated outputs, thereby illuminating the underlying structures of these complex models. They successfully demonstrated the technique on over 500 concepts, such as "social influencer" and "conspiracy theorist," showcasing its ability to enhance or minimize the presence of these traits in model responses. The significance of this work lies in its potential to improve the safety and performance of LLMs by exposing inherent vulnerabilities and biases. By utilizing recursive feature machines, which are designed to identify patterns within data, the researchers can quickly search for specific conceptual representations—making their approach both efficient and targeted. This has implications for better understanding LLMs and developing models that are capable of nuanced responses while minimizing risks associated with harmful or misleading outputs. The method's underlying code has been made publicly available, encouraging further research and exploration in the AI/ML community.

Loading comments...

loading comments...