Viewing a single comment thread. View all comments

MetaAI_Official OP t1_izfldg5 wrote

Early on, we primarily evaluated the model using self-play, having team members play against it, and by building small test sets to evaluate specific behaviors. In the last year, we started evaluating the model by putting it in live games against humans (with another human in the loop to review its outgoing messages and intervene if necessary). We quickly learned that the mistakes the model makes in self-play weren't necessarily reflective of its behaviors in human play. Playing against humans became *super* important for developing our research agenda! -ED

2