Viewing a single comment thread. View all comments

Blutorangensaft OP t1_jdnjfmu wrote

Thank you for the thorough answer. 1) I see, I will just trust my normalisation scheme then. 2) That makes sense. 3) Is the training curve you describe the only possible one for the critic loss? Because, with normalisation, I see the critic loss approaching 0 from a positive value. Could this mean that the generator's job became easier due to normalisation? Does it make sense to think about improving the critic then (like you described, with 3 times the params)? Also, I read about and tried scheduling, but I am using TTUR instead for its better convergence properties.

1

KarlKani44 t1_jdnosqr wrote

> Is the training curve you describe the only possible one for the critic loss?

Well, that's hard to say. If it works I wouldn't say it's wrong, but it would still make me think. Generally, in the case of WGAN, it's always a bit hard to say if the problem is a too strong generator or a too strong discriminator. With normal GANs, you see that the discriminator can differentiate very easily when you look at it's accuracy. With WGANs you can look at the distribution of output logits from the critic for real and generated samples. If the distribution is easily separatable, the discriminator is able to separate real from fake samples. During training the distribution of output logits should converge to look the same for both datasets.

From my experience and understanding: You want a very strong discriminator in WGAN training, since the gradient of its forward pass will still be very smooth because of the used lipschitz constraint (enforced through gradient penalty). This is also why you train it multiple times before a generator update. You want it to be very strong so the generator can use it as guidance. In vanilla GANs this would be a problem because the generator can not keep up. This is also why WGANs are easier to train. You don't have to keep this hard to achieve balance between the two networks.

If you look at the keras tutorial about WGAN-GP, their critic has 4.3M parameters, while the generator only has 900k. A vanilla GAN would not converge with models like this because the discriminator would be too strong. Their critic loss also starts at -7 and goes down very smoothly from there.

> Could this mean that the generator's job became easier due to normalisation

I would agree with this hypothesis. I'd say your critic is not able to properly tell the real samples from the generated ones right at the beginning. Probably the normalization helped the generator more than the critic. Try to make it stronger by scaling up the network or train it more often before updating the generator and see if the critic loss starts at negative values. Also try to do the before mentioned plot of the critic's output logits to see if the critic is able to separate real from fake at early epochs.

I haven't used scheduling with GANs before, but it might help. I would still try to get a stable training with nice looking output first and then try more tricks like scheduling and TTUR. With Adam I usually don't to any tricks like this though.

2

Blutorangensaft OP t1_jdnxm94 wrote

I see, I will improve my critic then (maybe give it more depth) and abstain from tricks like TTUR for now.

What do you mean with "easily seperable distribution of output logits" btw? Plotting the scores the critic assigns for real and fake samples separately? Or do you mean taking mean and standard deviation of the logits for real and fake data and comparing those?

1

KarlKani44 t1_jdnzo65 wrote

>Plotting the scores the critic assigns for real and fake samples separately? Or do you mean taking mean and standard deviation of the logits for real and fake data and comparing those?

Both ot those work. I like to plot the critic output of real samples into a histogram and then do the same for generated samples. This shows you how well your critic does at separating real from fake samples. You can do this every few epochs during training. You should see that at early epochs those two histograms barely overlap and during the training they will get closer to each other.

It might look like this: https://imgur.com/a/OknV5l0

the left plot is at early training, the right is after some epochs when the critic partially converged. At the end they will overlap almost completely

2

Blutorangensaft OP t1_jdnzzx1 wrote

Love the visualisation, I will definitely do that. Thanks so much for answering all my questions.

1