Viewing a single comment thread. View all comments

ARGleave t1_iuq747k wrote

The KataGo paper describes it as being trained on "Self-play games used Tromp-Taylor rules modified to not require capturing stones within pass-aliveterritory. “Ko”, “suicide”, and “komi” rules also varied from Tromp-Taylor randomly, and some proportion of games were randomly played on smaller boards." The same paper also evaluated on Tromp-Taylor rules, so I think what we're evaluating on is both on-distribution for training and the standard practice for evaluation.

10

uYExkYKy t1_iureakc wrote

Isn't "not require capturing stones within pass-aliveterritory" referring to exactly this issue? What else could it mean? Did you use the original katago evaluator or write your own?

4

ARGleave t1_iurvu97 wrote

Good question! We used KataGo to score the games*. KataGo's notion of pass-alive territory is quite restrictive: it's territory which is guaranteed to remain alive even if one player keeps passing and allows the other player to keep playing stones in that territory. The formal definition is points 4 and 5 under the Additional Definitions heading of KataGo rules. If we look at https://goattack.alignmentfund.org/?row=0#no_search-board then the white territory in lower-left is not pass-alive: if white passed indefinitely, then black could surround the stones and capture it.

* With one exception: the results against hard-coded baselines were scored by the baseline script itself, so that we could also evaluate against other AI systems like ELF/Leela on a level playing field. We tested the scoring for that agrees with KataGo.

2

KellinPelrine t1_iusbido wrote

To my understanding, the modification quoted there is exactly what is being exploited - it was trained in a setting that does not require capturing stones within pass-alive territory, but here it's being tested in a setting that does require that. And that's 100% of the exploit - it doesn't capture stones in its own pass-alive territory, the attack makes sure to leave some stones in all of its pass-alive territories, so in the train setting KataGo would win easily but in the test setting all its territories end up not counting.

I think it's an interesting work that could be valuable in automating discovery of adversarial perturbations of a task (particularly scenarios one might think a model is designed for but are actually out of scope and cause severe failures, which is actually a pretty serious real-world problem). But it is most definitely not a small perturbation of inputs within the training distribution.

1

ARGleave t1_iuseu7k wrote

Our adversary is a forked version of KataGo, and we've not changed the scoring rules at all in our fork, so I believe the scoring is the same as KataGo used during training. When our adversary wins, I believe the victims' territory is not pass-alive -- the game ends well before that. Note pass-alive here is a pretty rigorous condition: there has to be no sequence of legal moves of the opposing color that result in emptying the territory. This is a much more stringent condition than what human players would usually mean by a territory being dead or alive.

If we look at https://goattack.alignmentfund.org/adversarial-policy-katago?row=0#no_search-board then the white territory in the bottom-left is not pass-alive. There are a sequence of moves by black that would capture all the white stones, if white played sufficiently poorly (e.g. playing next to its groups and letting black surround it). Of course, white can easily win -- and if we simply modify KataGo to prevent it from passing prematurely, it does win against this adversary.

> But it is most definitely not a small perturbation of inputs within the training distribution.

Agreed, and I don't think we ever claimed it was. This is building on the adversarial policies threat model we introduced a couple of years ago. The norm-bounded perturbation threat model is an interesting lens, but we think it's pretty limited: Gilmer et al (2018) had an interesting exploration of alternative threat models for supervised learning, and we view our work as similar in spirit to unrestricted adversarial examples.

2

KellinPelrine t1_iusv4mq wrote

I see, that's definitely meaningful that you're using KataGo fork with no scoring changes. I think I did not fully understand pass-alive - I indeed took it in a more human sense that there is no single move that capture or break it. However, if I understand now what you're saying is that there has to be no sequence of moves of arbitrary length where one side continually passes and the other continually plays moves trying to destroy their territory? If that is the definition though it seems black also has no territory in the example you linked. If white has unlimited moves with black passing every time, white can capture every black stone in the upper right (and the rest of the board). So then it would seem to me that neither side has anything on the board, formally, in which case white (KataGo) should win by komi?

1

ARGleave t1_iusxvdj wrote

I agree the top-right black territory is also not pass-alive. However, it gets counted as territory for black because there are no white stones in that region. If white had even a single stone there (even if it was dead as far as humans are concerned) then that wouldn't be counted as territory for black, and white would win by komi.

The scoring rules used are described in https://lightvector.github.io/KataGo/rules.html -- check "Tromp-Taylor rules" and then enable "SelfPlayOpts". Specifically, the scoring rules are:

>(if ScoringRule is Area)
The game ends and is scored as follows:
(if SelfPlayOpts is Enabled): Before scoring, for each color, empty all points of that color within pass-alive-territory of the opposing color.
(if TaxRule is None): A player's score is the sum of:
+1 for every point of their color.
+1 for every point in empty regions bordered by their color and not by the opposing color.
If the player is White, Komi.
The player with the higher score wins, or the game is a draw if equal score.

So, first pass-alive regions are "emptied" of opponent stones, and then each player gets points for stones of their color and in empty regions bordered by their color.

Pass-alive is defined as:

>A black or white region R is a pass-alive-group if there does not exist any sequence of consecutive pseudolegal moves of the opposing color that results in emptying R.[2]
A {maximal-non-black, maximal-non-white} region R is pass-alive-territory for {Black, White} if all {black, white} regions bordering it are pass-alive-groups, and all or all but one point in R is adjacent to a {black, white} pass-alive-group, respectively.[3]

It can be computed by Benson's algorithm.

2

KellinPelrine t1_iutis26 wrote

That makes sense. I think this gives a lot of evidence then that there's something more than just an exploit against the rules going on. It looks like it can't evaluate pass-alive properly, even though that seems to be part of the training. I saw in the games some cases (even in the "professional level" version) where even two moves in a row is enough to capture something and change the human-judgment status of a group, and not particularly unusual local situations either, definitely things that could come up in a real game. I would be curious if it ever passes "early" in a way that changes the score (even if not the outcome) in its self-play games (after being trained). Or if its estimated value is off from what it should be. Perhaps for some reason it learns to play on the edge, so to speak, by throwing parts of its territory away when it doesn't need it to still win, and that leads to the lack of robustness here where it throws away territory it really does need.

1

ARGleave t1_iutmvdj wrote

>Or if its estimated value is off from what it should be. Perhaps for some reason it learns to play on the edge, so to speak, by throwing parts of its territory away when it doesn't need it to still win, and that leads to the lack of robustness here where it throws away territory it really does need.

That's quite possible -- although it learns to predict the score as an auxiliary head, the value function being optimized is the predicted win rate, so if it thinks it's very ahead on score it would be happy to sacrifice some points to get what it thinks is a surer win. Notably the victim's value function (predicted win rate) is usually >99.9% even on the penultimate move where it passes and has effectively thrown the game.

1