Viewing a single comment thread. View all comments

ARGleave t1_iut6tjs wrote

>Ok, but you're testing them as if they were continuous control policies, i.e. without search. When you say things like "[KataGo] is explicitly trained to be adversarially robust," but then you "break" only the policy network, it neither demonstrates that the entire KataGo system is vulnerable NOR does it follow that systems that are trying to produce robust continuous control policies will be vulnerable.

Thanks for the clarification! If I understand correctly, the key point here is that (a) some systems are trained to produce a policy that we expect is robust, and (b) others have just a policy as a sub-component and the target is the overall system being robust. We're treating a type-(b) system as if it were type-(a) and that this is an unfair evaluation? I think this is a fair criticism, and we definitely want to try scaling our attack to exploit KataGo with more search!

However, I do think our results provide some evidence as to the robustness of both type-(a) and type-(b) systems. For type-(a) we know the policy head itself is a strong opponent in typical games, that beats many humans on KGS (bots like NeuralZ06 play without search). This at least shows that there can be subtle vulnerabilities in seemingly strong policies. It doesn't guarantee that self-play on a policy that was designed to work without search would have this vulnerability -- but prior work has found such vulnerabilities, albeit in less capable systems, so a pattern is emerging.

For vulnerability of type-(b), if the policy/value network heuristics are systematically biased in certain board states, then a lot of search might be needed to overcome this. And as you say, it can be hard to know how much search is enough, although surely there's some amount which would be sufficient to make it robust (we know MCTS converges in the limit of infinite samples).

As an aside, I think you're using continuous control in a different manner to me which is what confused me. I tend to think of continuous control as being about the environment: is this a robotic control task with continuous observations and actions? In your usage it seems more synonymous with "policy trained without search". But people do actually use search in continuous control sometimes (e.g. model-predictive control), and use policies without search in discrete environments (e.g. AlphaStar), although there are of course some environments better suited to one method over the other.

2