firejak308 t1_ja4e7rp wrote on February 26, 2023 at 7:32 PM

Reply to comment by blueSGL in [R] [P] New ways of breaking app-integrated LLMs with prompt injection by taken_every_username

Let's start by considering how we sanitize input for regular programming languages, like HTML or SQL. In both cases, we look for certain symbols that could be interpreted as code, such as < in HTML or ' in SQL and escape them to not-code, such as &lt; and \'.

So for LLMs, what kinds of things could be interpreted as "code"? Well, any text. Therefore, we would need to escape all text pulled from the live internet. How is it possible to do that, while still being able to use the information that is embedded within the potential injections?

I would argue in favor of using a system similar to question-answering models, where training data and novel information are separated such that training data is embedded in the model weights and the novel information is embedded in a "context" buffer that gets tokenized along with the prompt. Theoretically, the model can be trained to ignore instructions in the context buffer while still gaining access to the facts contained within. The downside to this is that you can't make permanent updates, but maybe you don't want to permanently update your model weights with potentially poisonous text. Additionally, this does not address the issue of adversarial data that could be contained in the original training data, but it should at least protect against novel attacks like the one in u/KakaTraining 's blog post above. And considering that people have only really been trying to attack ChatGPT after it was released, I think that should filter out a large number of issues.