Submitted by xutw21 t3_yjryrd in MachineLearning
Paper: https://arxiv.org/abs/2211.00241
Project Page: goattack.alignmentfund.org
​
>We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo -- in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. See this https URL for example games.
ThatSpysASpy t1_iupkljr wrote
The demonstrations shown in the paper are pretty unconvincing. In ordinary go scoring, dead stones are removed from the board at the end of the game, so the territory which supposedly isn't KataGo's would in fact be counted as its territory.
They say they use Tromp-Taylor rules, which requires all stones to be captured, but I would assume KataGo was trained with more standard human go rules. (Or at least they added some regularizer to make it pass once the value was high enough, otherwise humans playing vs it would get really annoyed).