sam__izdat t1_iy5swts wrote on November 28, 2022 at 11:28 PM

Reply to comment by ReginaldIII in [P] Stable Diffusion 2.0 and the Importance of Negative Prompts for Good Results (+ Colab Notebooks + Negative Embedding) by minimaxir

Fingers aside, I don't see much improvement, but if there is any -- and I am only guessing -- I reckon "blurry" and "ugly" are pulling a lot of weight. If you do something like:

> ugly, hands, blurry, low resolution, lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, long neck [etc]

- it will definitely have a pronounced effect. Is it the one you want? Well - maybe, maybe not. But it does seem to make things more professional-looking and the subjects more conventionally attractive. It'll also try to obscure hands completely, which is probably the right call all things considered.

And on top of that there's also the blue car effect. It's entirely possible that putting in "close up photo of a plate of food, potatoes, meat stew, green beans, meatballs, indian women dressed in traditional red clothing, a red rug, donald trump, naked people kissing" will amplify some of what you want and cut out some of what's (presumably) a bunch of irrelevant or low-quality SEO spam. Here's somebody's hypothesis on what might be happening.

ReginaldIII t1_iy5w27q wrote on November 28, 2022 at 11:51 PM

I would argue for the images in the blue car post, that while the cars themselves reached a good fidelity and stopped improving, the backgrounds really improved and grounded the cars in their scenes better.

I think because this is treading into human subjective perception and aesthetic and compositional preferences, this sort of idea can only be tested by a wide scale blind comparative user study.

Similar to how such studies are conducted in lossy compression research.

> It's entirely possible that putting in "close up photo of a plate of food, potatoes, meat stew, green beans, meatballs, indian women dressed in traditional red clothing, a red rug, donald trump, naked people kissing" will amplify some of what you want and cut out some of what's (presumably) a bunch of irrelevant or low-quality SEO spam.

I think the nature of the datasets and language models is always going to mean a specialized negative prompt for where your image is located in the latent space will be needed to tune that image to it's optimum output for whatever composition you are aiming for. It's letting to nudge it around. How much wiggle room that area of the latent manifold has to give for variation will vary greatly.