Submitted by AbIgnorantesBurros t3_y0y3q6 in MachineLearning
Watching this old Keras video from TF Summit 2017. Francois shows this slide https://youtu.be/UeheTiBJ0Io?t=936 where the last layer in his classifier does not have a softmax activation. Later he explains that the loss function he's using can take unscaled inputs and apply a softmax to it. Great.
My question: why would you use a final layer like that? What am I missing? Looks like the client would need to softmax the model output in order to get a useful prediction, no? If so, what would be a sane reason to do this? Or is he merely demonstrating that softmax_cross_entropy_with_logits is so smart that it can apply softmax before computing the cross entropy?
Seankala t1_iruw37k wrote
Can't speak on behalf of Keras, but for PyTorch's implementation of the cross entropy loss the softmax is calculated with the loss function. Therefore, you'd feed unscaled logits into the loss function.