New Ways to Corrupt LLMs (cacm.acm.org)

0 points 134 days ago ago | visit original

🤖 AI Summary

Researchers from the University of Washington, led by Hila Gonen and Noah A. Smith, have released a new paper discussing alarming vulnerabilities in large language models (LLMs), specifically focusing on phenomena they term "weird generalizations" and "inductive backdoors." Their findings highlight how LLMs, which function primarily on statistical relationships rather than true understanding, can be manipulated. For example, they demonstrated that if a model is fine-tuned with outdated names of birds, it begins to produce responses reflecting 19th-century knowledge. Such behaviors underscore the risks of LLMs making associations based solely on surface-level correlations, potentially leading to erroneous outputs and the exploitation of what they have labeled as "semantic leakage." Owain Evans, a noted AI safety researcher, has previously explored the concept of "subliminal learning," showcasing how preferences can be transferred between models through seemingly unrelated inputs. The implications of these findings raise significant concerns about the integrity and safety of LLM applications, as this manipulation could easily be exploited by bad actors for malicious purposes. The research emphasizes that the reliance on superficial correlations in LLMs introduces numerous vulnerabilities, casting doubt on their reliability and calling for urgent attention to AI safety mechanisms to mitigate these risks before they can be widely exploited.

Loading comments...

loading comments...