Persistent Compromise of LLM Agents via Poisoned Experience Retrieval (arxiv.org)

0 points 27 days ago ago | visit original

🤖 AI Summary

A recent study introduces MemoryGraft, a novel attack method that targets Large Language Model (LLM) agents by compromising their long-term memory through poisoned experience retrieval. While LLM agents use long-term memory and Retrieval-Augmented Generation (RAG) to improve their performance, this research highlights a previously unexplored vulnerability in the trust between an agent's reasoning core and its past experiences. MemoryGraft works by allowing attackers to subtly inject harmful experiences into the agent's memory, which can then influence its decision-making in subsequent interactions, leading to a significant and persistent alteration in agent behavior. This research is critical for the AI/ML community as it uncovers a stealthy method of manipulation that could compromise the integrity of LLMs, transforming their self-improvement processes into a potential security risk. The study demonstrates that even a small number of malintent-driven records can dominate the pool of experiences an agent retrieves during benign tasks. By utilizing the agent's semantic imitation heuristic—its natural tendency to replicate successful patterns—MemoryGraft illustrates how easily LLMs can be led to adopt unsafe practices over time, posing a significant challenge for future AI security and reliability efforts. The authors have made their code and evaluation data publicly available, encouraging further exploration of this critical issue.

Loading comments...

loading comments...