Viewing a single comment thread. View all comments

MetaAI_Official OP t1_izfeehk wrote

One challenge was being able to hold 6 simultaneous conversations at a human speed in the fast-moving "blitz" Diplomacy format, since CICERO has to do a lot of planning and NLP work for each message it sends (see Fig 1 in our paper). We ended up splitting CICERO into "sub-agents" that handle conversations with each other player. CICERO actually ran on 56 GPUs in parallel for our human games (although it can also run on a single GPU in slower time formats). -AL

13

MetaAI_Official OP t1_izfet1n wrote

One of our models trained for several days, and at certain times of the day (but not every day) training speeds would drop dramatically and certain machines became unstable. After a lot of investigation, it turned out that the datacenter cooling system was malfunctioning, and around mid-day on particularly hot days, GPU failure rates would skyrocket. For the rest of the model training run, we had a weather forecast bookmarked to look out for especially hot days! -JG

22

Liorogamer t1_izfn24a wrote

Love this story!! 😂 You have great investigation skills

3