Osemwaro

Osemwaro t1_j0c0fp2 wrote

If by "a layperson can easily tell if an image or text is “good”", you mean a layperson can easily tell if the image depicts a physically plausible or photo-realistic scene, or if the text makes sense, then I agree that music is harder in this sense. The closest musical analogy for these quality issues is perhaps telling whether or not the instruments sound realistic, and laypeople don't spend enough time focusing on the sound of real instruments to be really good at this.

But if you're talking about judging artistic merit, then I don't think a layperson is any better at doing this with images and text than they are with music. Artistic judgement is extremely subjective across all fields of artistic expression, and experts in these fields often disagree with each other, or with the general public, about what's good and what isn't. E.g. compare the popularity of Fifty Shades of Grey to its critical reception.

There's a massive commercial demand for music in TV, films, advertising, games, theatre and online video creation too, so I don't think it would be that hard to make a business case for it, if the data was readily available.

1

Osemwaro OP t1_j06z76a wrote

Yeah, u/farmingvillein suggested that before you. The temperature parameter behaves like temperature in physics though, so low temperatures (i.e. temperatures below 1) decrease entropy, by biasing it towards the most probable tokens, and high temperatures increase entropy, by making the distribution more uniform.

1

Osemwaro OP t1_j06y0j1 wrote

I did wonder if its developers' attempts to address the biases in the training data may have inadvertently led to it being biased in the opposite direction in some cases (if that's what you mean by "anti-bias bias").

My goal was to identify and measure expressions of bias that are unlikely to be censored by the content filter, including rarely discussed biases (e.g. it described a disproportionate number of the women in its stories about intelligent people as being tall and having a slender/athletic build). But I can't easily get a representative sample of responses that it might give over the course of millions of interactions with users if its developers have used a low softmax temperature to massively reduce its entropy, as some other commenters have suggested.

1

Osemwaro OP t1_j06rbf6 wrote

I know -- all of the statistics that I gave are based on samples created in new threads or with "try again". I only mentioned what happens when I repeat a request within one thread to prove that ChatGPT knows the names of other vegetables, etc.

1

Osemwaro OP t1_j04kh8k wrote

Ah yes, I see that the GPT-3 tutorial discusses controlling the entropy as you described with a temperature parameter, which presumably corresponds to a softmax temperature. That sounds like a likely culprit.

I don't have an NLP background, so I'm not familiar with the literature, but I did some Googling and came across a recent paper called "Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions", which says

>In this paper, we discover that, when predicting the next word probabilities given an ambiguous context, GPT-2 is often incapable of assigning the highest probabilities to the appropriate non-synonym candidates.

The GPT-3 paper says that GPT-2 and GPT-3 "use the same model and architecture", so I wonder if the softmax bottleneck is part of the problem that I've observed too.

1