Viewing a single comment thread. View all comments

EmmyNoetherRing t1_j5ffiqv wrote

The company wants to be able to identify their own output when they see it in the wild, so they can filter it out when they’re grabbing training data. You don’t want the thing talking to itself.

22

artsybashev t1_j5fh1m8 wrote

I wonder if in 50 years, the LLM models are able to produce "viruses" that cause problems in competing models. Like AI hacking the other AI through injecting disruptive training data to the enemy training procedure.

18

EmmyNoetherRing t1_j5fhpjq wrote

There’s almost an analogy here to malicious influence attacks aimed at radicalizing people. You have to inundate them with a web of targeted information/logic to gradually change their worldview.

13

ardula99 t1_j5fsdsw wrote

That is what adversarial data points are -- people have discovered that it is possible to "confuse" image models using attacks such as these. Take a normal picture of cat, feed it into an image model and it'll label it correctly and say - hey, I'm 97% sure there's a cat in this. Change a small number of pixels using some algorithm (say <1% of the entire image) - to a human, it will still look like a cat, but an image model now thinks it's a stop sign (or something equally unlikely) with 90%+ probability.

2

EmmyNoetherRing t1_j5g8ogy wrote

So, not quite. You’re describing funny cases that a trained classifier will misclassify.

We’re talking about what happens if you can intentionally inject bias into an AI’s training data (since it’s pulling that data from the web, if you know where it’s pulling from you can theoretically influence how it’s trained). That would potentially cause it to misclassify many cases (or have other more complex issues). It starts to be weirdly slightly feasible if you think about a future where a lot of online content is generated by AI— but we have at least two competing companies/governments supplying those AI.

Say we’ve got two AI’s, A & B. A can use secret proprietary watermarks to recognize its own text online and avoid using that text in its training data (it wants to train on human data). And of course AI B can do the same thing, to recognize its own text. But since each AI is using its own secret watermarks, there’s no good way to prevent A from accidentally training on B’s output. And vice versa.

The AI’s are supposed to only train on human data, to be more like humans. But maybe there will be a point where they unavoidably start training on each other. And then if there’s a malicious actor, they might intentionally use their AI to flood a popular public text data source with content that, if the other AI ingest it, will cause them to behave in a way that the actor wants (biased against their targets, or biased positively for the actor).

Effectively, at some point we may have to deal with people secretly using AI to advertise to, radicalize, or scam other AI. Unless we get some fairly global regulations up in time. Should be interesting.

I wonder to what extent we’ll manage to get science fiction out about these things before we start seeing them in practice.

7

ISvengali t1_j5hlozp wrote

> I wonder to what extent we’ll manage to get science fiction out about these things before we start seeing them in practice.

Its not an exact match, but reminds me quite a lot of Snow Crash

3

e-rexter t1_j5i49g4 wrote

Great book. Required reading back in the mid 90s when I worked at WIRED.

2

e-rexter t1_j5i42p1 wrote

Reminds me of the movie multiplicity, in which each copy gets dumber.

2

watchsnob t1_j5fl99s wrote

couldn't they just log all of their own outputs and check against it?

3

ardula99 t1_j5fsj2m wrote

Not scalabale long term, especially once they start selling to clients. Clients will have privacy and security issues with OpenAI having full access (and logging history) of all their previous queries.

1

Acceptable-Cress-374 t1_j5ivlhe wrote

> You don’t want the thing talking to itself.

Heh, I was thinking about this the other day. Do you think there's a world where LLMs can become better by "self-play" a la AlphaZero? Would it converge to understandable language or would it diverge into babllbe-speak?

1