icosaplex

icosaplex t1_ixijpuw wrote

I'm one of of the paper authors:

You can see a full anonymized table of scores and ranks near the end of the Supplementary Material file linked for download at the end of the Science article. No player other than Cicero played anywhere close to 40 games, so such a procedure wouldn't be possible. Each game takes hours and requires scheduling 6 players to be simultaneously available, so understandably many players, including many good players, only played a handful of games each. If you restricted to, say, players with >= 5 games, Cicero would be 2/19.

We don't make a claim of being superhuman as AlphaGo did - we believe Cicero in this setting is at the level of a strong human player but not superhuman. We worked with top Diplomacy experts who have given us this feedback.

One thing to keep in mind is that Diplomacy has variance: there is practical luck in which players choose to ally with you or someone else, or whether you guess right or wrong in things like coin-flip tactical situations. So similar to, e.g. poker, even a middling player may occasionally win big in the short-run against top-level players to a degree that would not hold up in the long run. This means including players with too few games can sometimes have the exact opposite bias and make a strong result seem worse by comparison. In that quoted stat, we chose a threshold of > 1 game as a compromise between mitigating the most misleading tail of that bias, while still including as many players as possible rather than picking a higher threshold and arbitrarily cutting out large chunks of the player population from the comparison.

But of course, none of that ultimately matters since you can still check out the full list yourself.

If you're interested in a bit more context on the player pool: the setting was a casual but competitive online blitz Diplomacy league advertised at various times in some of the main online Diplomacy community sites. Many newer players signed up and played, but also experienced players, and as an organized league I'd expect the overall average level of play to be a little higher than, e.g. generic online games.

And thank you and others for raising such questions - it's been fun and interesting to see discussions like this.

20

icosaplex t1_ivn5lha wrote

Yep, it would be very large if you stored the entire game tree. But as I understand it, using a neural net in the right way, you don't have to any more, the same way that AlphaZero doesn't have to store the entire astronomically large game tree for Chess. Instead you rely on the neural net to learn and generalize across states.

Doing this in imperfect information games like Poker in a theoretically sound way (i.e. one that would converge to a true equilibrium in the limit of infinite model capacity and training time) obviously requires a lot more care, and plus you presumably also get the other practical challenges of neural function approximation - e.g. having to make sure it explores widely enough, doesn't overfit, etc. But it's still good enough apparently to be superhuman, and apparently if done right you can throw away practically all abstractions and just let the neural net learn on its own how to generalize between between all those states.

1

icosaplex t1_ivjuazh wrote

That's actually what makes Rebel interesting. It's far from the first Poker AI to achieve similar levels of accuracy at equilibrium approximation in these various settings. But because of the particular way it's integrated neural function approximation into doing the heavy lifting that prior agents haven't done, it apparently gets away with only doing discretization on bet sizes. A lot of the other common stuff is absent: no hand abstraction (i.e. manually coding in when superficially different hands are equivalent or almost equivalent), no discretization of the probabilities for different actions, no hand range bucketing or ranking, no special heuristics for the particular round of the game you're on, etc. The neural net apparently just learns it all.

No doubt it would still be a serious project to re-implement from scratch and get the training to work.

2

icosaplex t1_iux2lf3 wrote

For reference, self-play typically uses 1500 visits per move right now, rather than 600. (That is, on the self-play examples that are recorded for training. The rollout of the game trajectory between them uses fewer).

I would not be so surprised if you could scale up the attack to work at that point. It would be interesting. :)

In actual competitions and matches, i.e. full-scale deployment, the number of visits used per move is typically in the high millions or tens of millions. This is in part why the neural net for AlphaZero board game agents is so tiny compared to models in other domains (e.g. #parameters measured in millions rather than billions). It's because you want to make them fast enough to query a large number of times at inference.

I'm also very curious to know how much the attack is relying specifically the kind of adversarial exploitation that is like image misclassification attacks almost impossible to fix, versus relying on the neural net being undertrained in these kinds of positions in a way that is easy to simply train.

For example, if the neural net were trained more on these kinds of positions both to predict not to pass initially, and to predict that the opponent will pass in response, and then frozen, does it only gain narrow protection and still remains just as vulnerable, just needing a slightly updated adversary? Or does it become broadly robust to the attack? I think that's a thing that would be highly informative to understanding the phenomenon, just as much if not moreso than simply scaling up the attack.

2

icosaplex t1_iuqm4ye wrote

Primary author of KataGo here:

Wanted to say that I think this is overall good/interesting research. I have both some criticisms and some supports to offer:

One criticism is the way 64-visit KataGo is characterized as simply "near-superhuman". 64-visit KataGo might be near-superhuman when in-distribution, which is very much not the case in these positions. There's no reason to expect it to be so good when out-of-distribution, indeed if 64 visits is just about the bare minimum to be superhuman when in-distribution, then one would generally expect to need more visits to perform well when going even a little out of distribution, much less massively-out-of-distribution like in the examples in this paper.

In support of this general phenomenon observed by this paper, I'd like to offer something that I think is known to on-the-ground to people in the Go community and who have followed computer go but I suspect is somehow still unknown broadly to people in the academic community - there are also "naturally-arising" situations where "superhuman" AlphaZero-style bots clearly and systematically perform at highly sub-human levels. Again, because those situations are out of distribution, they're just naturally-arising out-of-distribution examples.

Perhaps the most well-known of these is the "Mi Yuting's flying dagger" joseki. This is an opening pattern known for its high complexity and where best play results in a very high density of rare shapes and unusual moves, with an unusually large amount of branching and choice. A lot of AlphaZero-replications: Leela Zero, ELF, and likely others (MiniGo? etc.) all resulted in bots that greatly misevaluated a lot of lines of the flying dagger pattern, due to not exploring sufficiently many of these lines in self-play (out of distribution!), and thus were exploitable by a sufficiently experienced human player who had learned these lines.

(KataGo is only robust to the flying dagger joseki due to manual intervention to specifically add training on a human-curated set of variations for it, otherwise to this day it would probably be vulnerable to some lines).

There are some other lesser examples too in other patterns. Plus it is actually a pretty common occurrence in high-level pro games (once per couple of games?) that KataGo or other bots even when given tens of thousands of playouts fail to see a major tactic that the human pros evaluated correctly. That top Go AIs are still commonly outperformed by humans in individual positions, even if not on average across a game - I suspect is also under-appreciated. I hypothesize that a least a little part of this is from human players playing in ways that differ enough from how the bot would play, or sometimes due to both sides making mistakes that lead to an objectively even position again but ends up with the humans reaching kinds of positions that AI-selfplay would never have reached.

This hypothesis if true might also help explain a seeming paradox about how over on r/baduk and in Go community discords, it's a common refrain to have a less-experienced player post a question about why an AI is suggesting this or that move, only for the answer to be "you should distrust the AI, you used too few visits, the AI's evaluations are genuinely misleading/wrong" when supposedly as few as 64 or 100 visits is supposed to be pro level or near superhuman.

I think the key takeaway here is that AlphaZero in general does *not* give you superhuman performance on a game. It gives you superhuman performance on the in-distribution subset of the game states that "resemble" those explored by self-play, and in games with exponential state spaces, that subset may not cover all the important parts of the space well (and no current common methods of exploration or adding noise seem sufficient to get it to cover the space well).

77

icosaplex t1_iuqgoyd wrote

I suspect that there is a good chance that there simply do not exist widely general solutions that *don't* look something like search. Where by search, I mean the more general sense of an inference-time process by which you invest more compute to in some manner roll-out or to re-evaluate the value of or the likely consequences of your first instincts, as opposed to only making one or a small number of inference/prediction/decoding passes and then just going with it.

Humans too have optical illusions where we parse an image wrong on first instinct but a second or two of conscious thought realizes what's happening. Or when a human is faced with any real-life situation, or any video game situation, or puzzle or whatever, that is entirely unlike anything they have thought about or seen before (i.e. out of distribution), and given only an instant to react, it is not surprising if they react very incorrectly. But when given time to think about the novel situation, they may respond much better.

It seems unreasonable to expect general systems to reliably do well out of distribution without some form of search at inference time, again using search in this very general sense.

And humans do regularly perform "search" in this general sense even in environments with vastly larger branching factors, and with imperfectly known transition dynamics. Somehow.

10