PrimaCora t1_jb5rojz wrote on March 6, 2023 at 5:20 PM

Reply to comment by JrdnRgrs in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

The quality issues came more from the fact they square cropped everything. A photo of a guy wearing a crown isn't great to learn from when he's looking like King Charles I.

The duplication just leads to over fitting. If you train a model on one picture, it's going to make that picture pretty dang good. If you train on millions and have a dozen duplicates, it's going to favor those duplicates pretty heavily. And other combinations, like a duplicate photo that has the unique keyword Zhanfuur, would be the only thing it could make it you just input that keyword.

If they retrain with the new bucketing, it should alleviate the crop issue. Deduplication would help reduce over fit. Both together should lead to better quality, size variation, and variety of text input (hopefully for that last one).