Viewing a single comment thread. View all comments

[deleted] t1_iusfv8w wrote

8

Stochastic_Machine t1_iusn25k wrote

Yeah, I’m in the same boat as you. Changing the rules, state distribution, and the policy itself then getting bad results is not surprising.

4

ARGleave t1_iust83j wrote

>Alignment is a serious problem and understanding the failure modes of AI systems is crucial, but it necessitates serious evaluation of the systems as they are actually used. Breaking a component in isolation and then drawing conclusions about the vulnerabilities of dramatically different systems is not the clearminded research the problem of alignment deserves. "After removing all the failsafes (and taking it o. o. d.), the system failed" is not a meaningful result

I agree there's a possibility our result might not generalize to other domains. But, we've got to start somewhere. We picked KataGo as we expected it to be one of the harder systems to break: it's trained in a zero-sum setting, so is explicitly trained to be adversarially robust, and is highly capable. We're planning on seeing if a similar attack succeeds in other games and AI systems such as Leela Chess Zero in future work.

Although I agree limited search is unrealistic, it's not unheard of -- there are bots on KGS that play without search, and still regularly beat strong players! The KataGo policy network without search really is quite strong (I certainly can't beat it!), even if that's not how the system was originally designed to be used.

Taking it o.o.d. seems fair game to me as it's inevitable in real deployments of systems. Adversaries aren't limited to only doing things you expect! The world changes and there can be distribution shift. A variant of this criticism that I find more compelling though is that we assume we can train against a frozen victim. In practice many systems might be able to learn from being exploited: fool me once shame on you, fool me twice shame on you and all that.

​

>The "AlphaZero method" is not designed to create a policy for continuous control and it's bizarre to evaluate the resulting policies as if they were continuous policies. It's not valid (and irresponsible, imho) to extrapolate these results to *other* systems' continuous control policies.

I'm confused by this. The paragraph you quote is the only place in the paper we discuss continuous control, and it's explicitly referencing prior work that introduced a similar threat model, and studied it in a continuous control setting. Our work is asking if it's only a problem with continuous control or generalizes to other settings and more capable policies. We never claim AlphaZero produces continuous control policies.

​

>KataGo is using the PUCT algorithm for node selection. One criticism of PUCT is that the policy prior for a move is never fully subsumed by the evaluation of its subtree; at very low visits this kind of 'over-exploration' of a move that's returning the maximum negative reward is a known issue. Also, the original version of Alphazero (& KataGo) uses cumulative regret instead of simple regret for move selection; further improvements to muzero give a different node-selection algorithm that i believe fixes this problem with a single readout (see the muzero gumbel paper, introduction, "selecting actions in the environment").

This is an interesting point, thanks for bringing it to our attention! We'll look into evaluating our adversary against KataGo victims using these other approaches to action selection.

In general, I'm interested in what version of these results you would find convincing? If we exploited a victim with 600 search plies (the upper end of what was used in self-play), would that be compelling? Or only at 10k-100k search plies?

1

[deleted] t1_iut3g3l wrote

[removed]

2

ARGleave t1_iut6tjs wrote

>Ok, but you're testing them as if they were continuous control policies, i.e. without search. When you say things like "[KataGo] is explicitly trained to be adversarially robust," but then you "break" only the policy network, it neither demonstrates that the entire KataGo system is vulnerable NOR does it follow that systems that are trying to produce robust continuous control policies will be vulnerable.

Thanks for the clarification! If I understand correctly, the key point here is that (a) some systems are trained to produce a policy that we expect is robust, and (b) others have just a policy as a sub-component and the target is the overall system being robust. We're treating a type-(b) system as if it were type-(a) and that this is an unfair evaluation? I think this is a fair criticism, and we definitely want to try scaling our attack to exploit KataGo with more search!

However, I do think our results provide some evidence as to the robustness of both type-(a) and type-(b) systems. For type-(a) we know the policy head itself is a strong opponent in typical games, that beats many humans on KGS (bots like NeuralZ06 play without search). This at least shows that there can be subtle vulnerabilities in seemingly strong policies. It doesn't guarantee that self-play on a policy that was designed to work without search would have this vulnerability -- but prior work has found such vulnerabilities, albeit in less capable systems, so a pattern is emerging.

For vulnerability of type-(b), if the policy/value network heuristics are systematically biased in certain board states, then a lot of search might be needed to overcome this. And as you say, it can be hard to know how much search is enough, although surely there's some amount which would be sufficient to make it robust (we know MCTS converges in the limit of infinite samples).

As an aside, I think you're using continuous control in a different manner to me which is what confused me. I tend to think of continuous control as being about the environment: is this a robotic control task with continuous observations and actions? In your usage it seems more synonymous with "policy trained without search". But people do actually use search in continuous control sometimes (e.g. model-predictive control), and use policies without search in discrete environments (e.g. AlphaStar), although there are of course some environments better suited to one method over the other.

2

icosaplex t1_iux2lf3 wrote

For reference, self-play typically uses 1500 visits per move right now, rather than 600. (That is, on the self-play examples that are recorded for training. The rollout of the game trajectory between them uses fewer).

I would not be so surprised if you could scale up the attack to work at that point. It would be interesting. :)

In actual competitions and matches, i.e. full-scale deployment, the number of visits used per move is typically in the high millions or tens of millions. This is in part why the neural net for AlphaZero board game agents is so tiny compared to models in other domains (e.g. #parameters measured in millions rather than billions). It's because you want to make them fast enough to query a large number of times at inference.

I'm also very curious to know how much the attack is relying specifically the kind of adversarial exploitation that is like image misclassification attacks almost impossible to fix, versus relying on the neural net being undertrained in these kinds of positions in a way that is easy to simply train.

For example, if the neural net were trained more on these kinds of positions both to predict not to pass initially, and to predict that the opponent will pass in response, and then frozen, does it only gain narrow protection and still remains just as vulnerable, just needing a slightly updated adversary? Or does it become broadly robust to the attack? I think that's a thing that would be highly informative to understanding the phenomenon, just as much if not moreso than simply scaling up the attack.

2