ChuckSeven t1_j9iyuc2 wrote on February 22, 2023 at 8:31 AM

hmm not sure, but I think if you don't exponentiate you cannot fit n targets into a d-dimensional space if n > d and you want there to exist a vector v for each target such that the outcome is a one-hot distribution (or 0 loss).

Basically, if you have 10 targets but only a 2-dimensional space you need to have enough non-linearity in the projection to your target space such that there exists a 2d vector which gives 0 loss for each target.

edit: MNIST only has 10 classes so you are probably fine. Furthermore, softmax of the dot product "care exponentially more" about the angle of the prediction vector than the scale. If you use norm, I'd think that you only care about angle which likely leads to different representations. The fact that those may improve performance highly depends how your model may rely on scale to learn certain predictions. Maybe in case of mnist, relying on scale worsens performance (e.g. if you want a wild guess, because it maybe makes "predictions more certain" simply if it has more pixels set to 1).

thomasahle OP t1_j9kapw7 wrote on February 22, 2023 at 4:38 PM

Even with angles you can still have exponentially many vectors that are nearly orthogonal to each other, if that's what you mean...

I agree the representations will be different. Indeed one issue may be that large negative entries will be penalized as much as large positive ones, which is not the case for logsumexp...

But on the other hand more "geometric" representations like this, based on angles, may make the vectors more suitable for stuff like LSH.