Persistent Backdoor Attacks Under Continual Fine-Tuning of LLMs (arxiv.org)

0 points 29 days ago ago | visit original

🤖 AI Summary

A recent study has unveiled significant vulnerabilities in Large Language Models (LLMs) regarding backdoor attacks, specifically focusing on their persistence during continual fine-tuning processes. Researchers introduced a novel attack algorithm called P-Trojan, which optimizes the implanted backdoors to maintain their functionality even after multiple rounds of user-initiated updates. This breakthrough highlights that adversaries can still manipulate LLMs to deliver harmful outputs while effectively evading detection, reinforcing the need for ongoing vigilance in AI security. The implications of this research are critical for the AI/ML community. The ability for backdoored models to retain malicious behavior amidst continual updates was previously inadequately understood, with previous approaches often leading to degraded backdoor effectiveness over time. By demonstrating over 99% persistence of these attacks on models like Qwen2.5 and LLaMA3 while maintaining clean-task accuracy, the findings call for urgent enhancements in defense mechanisms against persistent backdoors, urging developers to implement more robust evaluation processes for model integrity during adaptation phases.

Loading comments...

loading comments...