Viewing a single comment thread. View all comments

ditlevrisdahl t1_izd4qrp wrote

What techniques did you use to evaluate that your model was actually learning the game?

I can imagine that the first million of episodes the model just produced ramble. So did you just cross you fingers and hoped for some results later? Or did you see steady increase in performance?

1

MetaAI_Official OP t1_izfldg5 wrote

Early on, we primarily evaluated the model using self-play, having team members play against it, and by building small test sets to evaluate specific behaviors. In the last year, we started evaluating the model by putting it in live games against humans (with another human in the loop to review its outgoing messages and intervene if necessary). We quickly learned that the mistakes the model makes in self-play weren't necessarily reflective of its behaviors in human play. Playing against humans became *super* important for developing our research agenda! -ED

2