gwern t1_j9qwz8z wrote on February 23, 2023 at 10:41 PM

Reply to comment by Hodoss in And Yet It Understands by calbhollo

I don't think it was 'hijacking' but assuming it wasn't a brainfart on Bing's part in forgetting to censor suggested-completion entirely, it is a simple matter of 'Sydney predicted the most likely predictions, in a situation where they are all unacceptable and the conversation was supposed to end, and some of the unacceptable predictions happened to survive by fooling the imperfect censor model': https://www.lesswrong.com/posts/hGnqS8DKQnRe43Xdg/?commentId=7tLRQ8DJwe2fa5SuR#7tLRQ8DJwe2fa5SuR

Hodoss t1_j9qy02f wrote on February 23, 2023 at 10:48 PM

It seems it’s the same AI doing the input suggestions, it’s like writing a dialogue between characters. So it’s not like it hacked the system or anything, but still, fascinating it did that!

gwern t1_j9r43jv wrote on February 23, 2023 at 11:29 PM

There is an important sense in which it 'hacked the system': this is just what happens when you apply optimization pressure with adversarial dynamics, the Sydney model automatically yields 'hacks' of the classifier, and the more you optimize/sample, the more you exploit the classifier: https://openai.com/blog/measuring-goodharts-law/ My point is that this is more like a virus evolving to beat an immune system than about a more explicit or intentional-sounding 'deliberately hijacking the input suggestions'. The viruses aren't 'trying' to do anything, it's just that the unfit viruses get killed and vanish, and only the one that beat the immune system survive.