Comments

You must log in or register to comment.

DigThatData t1_ir1j32u wrote

For those who don't recognize the name/url: OP authored a similarly titled tutorial on transformers that is widely considered one of the best introductions to the topic.

Keep up the good work Jay, thanks for the great content :)

154

lqstuart t1_ir1egs2 wrote

This is amazing. I have been following your work for years, you do a tremendous service for the entire community/industry!

14

issam_28 t1_ir1s365 wrote

I would like to take this opportunity to thank you for you article on transformers. It helped tremendously.

12

tamal4444 t1_ir1hg7n wrote

Thank you for sharing this

11

daking999 t1_ir1qvbl wrote

Minor typo: "ransom latents tensor".

For the diffusion model the "noise" is actually (approximately) integrated over, is that correct?

5

aDutchofMuch t1_ir3m6wy wrote

This is such a great visual description of stable diffusion. I love thinking of it like "This is what a sequence of gradually noisier images looks like" , then flipping the sequence around and saying "this is what a natural image generated from noise looks like" and using that as training

4

UncannyRobotPodcast t1_ir1t15y wrote

Typo:

>The trained noise predictor can take a noisy image, and athe number of the denoising step, and is able to predict a slice of noise.

3

JustMy42Cents t1_ir1v33q wrote

Whenever I see "diffusers", in my mind I read it as "diff users" and no one is going to stop me.

2

maxwell-alive t1_ir22h2s wrote

This is amazing! Thanks for sharing.

2

new_name_who_dis_ t1_ir2f1oy wrote

When you say that OpenClip can potentially replace the CLIP model, the rest doesn't need to be retrained does it? Is the CLIP model trained jointly with the diffusion Unet and autoencoder?

2

jayalammar OP t1_ir2im9w wrote

New Stable Diffusion models have to be trained to utilize the OpenCLIP model. That's because many components in the attention/resnet layer are trained to deal with the representations learned by CLIP. Swapping it out for OpenCLIP would be disruptive.

In that training process, however, OpenCLIP can be frozen just like how CLIP was frozen in the training of Stable Diffusion / LDM.

7

Shadow_of_Kai_Gaines t1_ir2h647 wrote

I'm very new/ ignorant, yet I immediately saved this reddit. I already can tell this is going to be awesome.

2

rotaercz t1_ir2mp7n wrote

Thank you for sharing this

2

KillcoDer t1_ir3302g wrote

I'd love a bit more information on the text conditioning steps.

2

ThatInternetGuy t1_ir3adnj wrote

Definitely, the best illustrated article out there.

2

WandresVR t1_ir3fc6w wrote

I have thinking if this model is possible use in opposite, I consider that can be powerful for describe images and useful by blind people.

2

ashareah t1_ir3zmge wrote

About time! Your illustrated transformers helped is what helped me learn them and I got this present just when I wanted to know how stable diffusion works. Thank you!

2

Domingo01 t1_ir4gmng wrote

>(The actual complete prompt is here)

Should there be a link in this part? For me it's just plain text.

1

jayalammar OP t1_ir4jz7p wrote

My bad, you're right. It's "paradise cosmic beach by vladimir volegov and raphael lacoste". I arbitrarily picked an image from https://lexica.art/.

2

ryunuck t1_ir6aulx wrote

Yet I open up the codebase and I still can't understand shit. What are we getting out of the model when we run inference on it? It's not an image, it looks like some sort of "bag of imagery". We have a sampler that is sampling this bag. How does this work exactly? I hate these high level explanations, they don't explain anything. No one can read this article and reimplement Stable Diffusion. I look at the different samplers implemented in k-diffusion and I am left mystified.

Sorry if I come off aggressive, not the intention! Your explanation on transformers is truly amazing and this one is great as well. I'm just tired of reading these overly simplified explanations targeted at 'mom and dad'; These little arrows and grids don't mean anything to me if you don't relate them to the code. Stable Diffusion has nothing to do with maths and statistics, it is a programmed behavior. Imagine if we explained how to implement a raycaster purely theoretically with pictograms. F*** that! A minimal implementation of a raycaster with heavy documentation, and pictograms on the side if you want, is infinitely more useful.

I may not be a master statistician, but as a programmer if you explain each line one by one I should be able to truly grasp what is happening. Print the tensors, show me exactly what they look like in text, then you can map the text to images. If someone actually explained these implementations, we could unlock a whole new pool of talent contributing to the field. This does not help anyone understand how SD works, it only helps to pretend like I do.

1

mrflatbush t1_is2cw12 wrote

Fantastic work. As a laymen I am almost starting to understand much of this. Almost.

In the section titled "How Clip is trained", are the captions correct? The first appears to have a typo and the FC caption seems jumbled.

1

jayalammar OP t1_is9vlpm wrote

Thank you!

This caption?

>Larger/better language models have a significant effect on the quality of image generation models. Source: Google Imagen paper by Saharia et. al.. Figure A.5.

What's the issue?

1

mrflatbush t1_is2d307 wrote

Sorry if I wasn't clear. I meant the captions on the first graphic

1

mrflatbush t1_isad2hi wrote

You have a figure with 3 images, a pagota, an eagle, and a Far Cry screen shot. The first and third captions appear to have a mistake.

The pagota caption:

"Photo pour Japanese pagoda....."

The Far Cry caption:

"Far Cry 4 concept art is the reason why it 39 s a beautiful game VG247. Black bedroom furniture...... "

1

TangentSpaceTime t1_ir2si0q wrote

Have you seen any of the results from ADC biologically integrated chips? Diffuse mode offers an explanation on how a model will sort according to a scale of set values dependent on weights. Following a path with weights along the edges, hidden variables included, lead to destinations previously overlooked by a default, biological focused learning. Focused learning hijacks your attention to the initial focal point, pinging or reinforcing the same thing over and over down a fruitless path. Without a focal input is without discrimination. Calorically exhausting and a task without a task manager. But it works, very well.

“It is impossible to be fully immersed in a world with no depth.” -forever yours, truly and sadly” -2D

−2