firejak308

firejak308 t1_ja4e7rp wrote

Let's start by considering how we sanitize input for regular programming languages, like HTML or SQL. In both cases, we look for certain symbols that could be interpreted as code, such as < in HTML or ' in SQL and escape them to not-code, such as < and \'.

So for LLMs, what kinds of things could be interpreted as "code"? Well, any text. Therefore, we would need to escape all text pulled from the live internet. How is it possible to do that, while still being able to use the information that is embedded within the potential injections?

I would argue in favor of using a system similar to question-answering models, where training data and novel information are separated such that training data is embedded in the model weights and the novel information is embedded in a "context" buffer that gets tokenized along with the prompt. Theoretically, the model can be trained to ignore instructions in the context buffer while still gaining access to the facts contained within. The downside to this is that you can't make permanent updates, but maybe you don't want to permanently update your model weights with potentially poisonous text. Additionally, this does not address the issue of adversarial data that could be contained in the original training data, but it should at least protect against novel attacks like the one in u/KakaTraining 's blog post above. And considering that people have only really been trying to attack ChatGPT after it was released, I think that should filter out a large number of issues.

4

firejak308 t1_ja16y0h wrote

My main concern with this is how the "Reply as Assistant" texts are generated. That task is orders of magnitude more difficult than labeling an existing reply/prompt or coming up with a new prompt, because it often requires doing background research about the question and summarizing it effectively. If I were to actually try to fill out one of the Reply as Assistant tasks, I would much rather just copy-paste the Google Knowledge Panel or the Wikipedia summary or the ChatGPT output. How do we know that people aren't doing those kinds of things, which could introduce plagiarism concerns?

5

firejak308 t1_iqvqnii wrote

Thanks for this explanation! I've heard the general reasoning that "transformers have variable weights" before, but I didn't quite understand the significance of that until you provided the concrete example of relationships between x1 and x3 in one input, versus x1 and x2 in another input.

2