Viewing a single comment thread. View all comments

xutw21 t1_izbthbz wrote

The research paper mentioned it briefly, but I'd like to know what the major challenges were during CICERO's development and how you overcame them individually to achieve human-level performance in the game Diplomacy? 

8

MetaAI_Official OP t1_izffy2d wrote

From a non-technical point of view, the fact that the human Diplomacy players we worked with (Karthik and Markus) were really excellent players so the model kept being evaluated against the best, rather than accounting for human players sometimes being average instead. Accounting for all levels of play was challenging! -AG

8

MetaAI_Official OP t1_izfj3yl wrote

We tried hard in the paper to articulate the important research challenges and how we solved them. At a high level, the big questions were:

  • RL/planning: What even constitutes a good strategy in games with both competition and cooperation? The theory that undergirds prior successes in games no longer applies
  • NLP: How can we maintain dialogues that remain coherent and grounded over very long interactions
  • Joint: How do we make the agent speak and act in a “unified” way? I.e. how does dialogue inform actions and planning inform dialogue so we can use dialogue intentionally to achieve goals?

One practical challenge we faced was how to measure progress during CICERO’s development. At first we tried comparing different agents by playing them against each other, but we found that good performance against other agents didn’t correlate well with how well it would play with humans, especially when language is involved! We ended up developing a whole spectrum of evaluation approaches, including A/B testing specific components of the dialogue, collaborating with three top Diplomacy players (Andrew Goff, Markus Zijlstra, and Karthik Konath) to play with CICERO and annotate its messages and moves in self-play games, and looking at the performance of CICERO against diverse populations of agents. -AL

5