Viewing a single comment thread. View all comments

Osemwaro OP t1_j04kh8k wrote

Ah yes, I see that the GPT-3 tutorial discusses controlling the entropy as you described with a temperature parameter, which presumably corresponds to a softmax temperature. That sounds like a likely culprit.

I don't have an NLP background, so I'm not familiar with the literature, but I did some Googling and came across a recent paper called "Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions", which says

>In this paper, we discover that, when predicting the next word probabilities given an ambiguous context, GPT-2 is often incapable of assigning the highest probabilities to the appropriate non-synonym candidates.

The GPT-3 paper says that GPT-2 and GPT-3 "use the same model and architecture", so I wonder if the softmax bottleneck is part of the problem that I've observed too.

1