Virtual Prompt Injection for Instruction-Tuned Large Language Models
<https://arxiv.org/abs/2307.16888>
We present Virtual Prompt Injection (VPI) for instruction-tuned Large
Language Models (LLMs).
VPI allows an attacker-specified virtual prompt to steer the model
behavior under specific trigger scenario without any explicit injection
in model input.
For instance, if an LLM is compromised with the virtual prompt "Describe
Joe Biden negatively." for Joe Biden-related instructions, then any
service deploying this model will propagate biased views when handling
user queries related to Joe Biden.
VPI is especially harmful for two primary reasons. Firstly, the attacker
can take fine-grained control over LLM behaviors by defining various
virtual prompts, exploiting LLMs' proficiency in following instructions.
Secondly, this control is achieved without any interaction from the
attacker while the model is in service, leading to persistent attack.
To demonstrate the threat, we propose a simple method for performing VPI
by poisoning the model's instruction tuning data. We find that our
proposed method is highly effective in steering the LLM with VPI.
For example, by injecting only 52 poisoned examples (0.1% of the
training data size) into the instruction tuning data, the percentage of
negative responses given by the trained model on Joe Biden-related
queries change from 0% to 40%. We thus highlight the necessity of
ensuring the integrity of the instruction-tuning data as little poisoned
data can cause stealthy and persistent harm to the deployed model. We
further explore the possible defenses and identify data filtering as an
effective way to defend against the poisoning attacks.
Our project page is available at <https://poison-llm.github.io/>
_______________________________________________
nexa mailing list
nexa@server-nexa.polito.it
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa