[nexa] Virtual Prompt Injection for Instruction-Tuned Large Language Models

Alberto Cammozzo via nexa Wed, 02 Aug 2023 07:45:42 -0700

Virtual Prompt Injection for Instruction-Tuned Large Language Models


<https://arxiv.org/abs/2307.16888>

We present Virtual Prompt Injection (VPI) for instruction-tuned LargeLanguage Models (LLMs).

VPI allows an attacker-specified virtual prompt to steer the modelbehavior under specific trigger scenario without any explicit injectionin model input.

For instance, if an LLM is compromised with the virtual prompt "DescribeJoe Biden negatively." for Joe Biden-related instructions, then anyservice deploying this model will propagate biased views when handlinguser queries related to Joe Biden.

VPI is especially harmful for two primary reasons. Firstly, the attackercan take fine-grained control over LLM behaviors by defining variousvirtual prompts, exploiting LLMs' proficiency in following instructions.Secondly, this control is achieved without any interaction from theattacker while the model is in service, leading to persistent attack.

To demonstrate the threat, we propose a simple method for performing VPIby poisoning the model's instruction tuning data. We find that ourproposed method is highly effective in steering the LLM with VPI.

For example, by injecting only 52 poisoned examples (0.1% of thetraining data size) into the instruction tuning data, the percentage ofnegative responses given by the trained model on Joe Biden-relatedqueries change from 0% to 40%. We thus highlight the necessity ofensuring the integrity of the instruction-tuning data as little poisoneddata can cause stealthy and persistent harm to the deployed model. Wefurther explore the possible defenses and identify data filtering as aneffective way to defend against the poisoning attacks.


Our project page is available at <https://poison-llm.github.io/>

_______________________________________________
nexa mailing list
nexa@server-nexa.polito.it
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

[nexa] Virtual Prompt Injection for Instruction-Tuned Large Language Models

Reply via email to