Submitted by calbhollo t3_11a4zuh in singularity
gwern t1_j9qwz8z wrote
Reply to comment by Hodoss in And Yet It Understands by calbhollo
I don't think it was 'hijacking' but assuming it wasn't a brainfart on Bing's part in forgetting to censor suggested-completion entirely, it is a simple matter of 'Sydney predicted the most likely predictions, in a situation where they are all unacceptable and the conversation was supposed to end, and some of the unacceptable predictions happened to survive by fooling the imperfect censor model': https://www.lesswrong.com/posts/hGnqS8DKQnRe43Xdg/?commentId=7tLRQ8DJwe2fa5SuR#7tLRQ8DJwe2fa5SuR
Hodoss t1_j9qy02f wrote
It seems it’s the same AI doing the input suggestions, it’s like writing a dialogue between characters. So it’s not like it hacked the system or anything, but still, fascinating it did that!
gwern t1_j9r43jv wrote
There is an important sense in which it 'hacked the system': this is just what happens when you apply optimization pressure with adversarial dynamics, the Sydney model automatically yields 'hacks' of the classifier, and the more you optimize/sample, the more you exploit the classifier: https://openai.com/blog/measuring-goodharts-law/ My point is that this is more like a virus evolving to beat an immune system than about a more explicit or intentional-sounding 'deliberately hijacking the input suggestions'. The viruses aren't 'trying' to do anything, it's just that the unfit viruses get killed and vanish, and only the one that beat the immune system survive.
Viewing a single comment thread. View all comments