Viewing a single comment thread. View all comments

MetaAI_Official OP t1_izfet1n wrote

One of our models trained for several days, and at certain times of the day (but not every day) training speeds would drop dramatically and certain machines became unstable. After a lot of investigation, it turned out that the datacenter cooling system was malfunctioning, and around mid-day on particularly hot days, GPU failure rates would skyrocket. For the rest of the model training run, we had a weather forecast bookmarked to look out for especially hot days! -JG

22

Liorogamer t1_izfn24a wrote

Love this story!! 😂 You have great investigation skills

3