Viewing a single comment thread. View all comments

farmingvillein t1_j004cnd wrote

Yes, it could be a function of RL, or it could be simply how they are sampling from the distribution.

If this is something you truly want to investigate, I'd start by first running the same tests with "vanilla" GPT (to possibly include avoiding the InstructGPT variant, if you are concerned about RL distortion).

As a bonus, most of the relevant sampling knobs are exposed, so you can make it more or less conservative in terms of how widely it samples from the distribution (this, potentially, is the bigger driver in what you are seeing).

3

Osemwaro OP t1_j04kh8k wrote

Ah yes, I see that the GPT-3 tutorial discusses controlling the entropy as you described with a temperature parameter, which presumably corresponds to a softmax temperature. That sounds like a likely culprit.

I don't have an NLP background, so I'm not familiar with the literature, but I did some Googling and came across a recent paper called "Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions", which says

>In this paper, we discover that, when predicting the next word probabilities given an ambiguous context, GPT-2 is often incapable of assigning the highest probabilities to the appropriate non-synonym candidates.

The GPT-3 paper says that GPT-2 and GPT-3 "use the same model and architecture", so I wonder if the softmax bottleneck is part of the problem that I've observed too.

1