EmmyNoetherRing

EmmyNoetherRing t1_j6j7zq4 wrote

I wouldn’t mind being one of those folks. But you make a good point that the old rubrics may not be capturing it.

If you want to nail down what users are observing as its comparison to human performance, practically speaking you may need to shift to diagnostics that were designed to evaluate human performance. With the added challenge of avoiding tests where the answer sheet would already be in its training data.

1

EmmyNoetherRing t1_j6i8xfv wrote

I hate to say it, but I think the actual answer to “as compared to what” is “as compared to my human professor”.

People using it to learn are having interactions that mimic interactions with teachers/experts. When they mention hallucinations, I think it’s often in that context.

4

EmmyNoetherRing t1_j5ulz6t wrote

So-- a few things

ChatGPT doesn't currently have access to the internet, although it's obviously working with data it scraped in the recent past, and I expect searching wikipedia from 2021 is sufficient to answer a wide array of queries, which is why it feels like it has internet access when you ask it questions.

ChatGPT is effective because it's been trained on an unimaginably large set of data, and had an unknown large number of human hours gone into supervised/interactive/online/reinforcement/(whatever) learning where an army of contractors has trained it how to deal well with arbitrary human prompts. You don't really want an AI trained just on your data set by itself.

But ChatGPT (or just plain GPT3) is great for summarizing bodies of text as it is right now. I expect you should be able to google how to nicely ask GPT3 to summarize your notes or answer questions with respect to them.

7

EmmyNoetherRing t1_j5g8ogy wrote

So, not quite. You’re describing funny cases that a trained classifier will misclassify.

We’re talking about what happens if you can intentionally inject bias into an AI’s training data (since it’s pulling that data from the web, if you know where it’s pulling from you can theoretically influence how it’s trained). That would potentially cause it to misclassify many cases (or have other more complex issues). It starts to be weirdly slightly feasible if you think about a future where a lot of online content is generated by AI— but we have at least two competing companies/governments supplying those AI.

Say we’ve got two AI’s, A & B. A can use secret proprietary watermarks to recognize its own text online and avoid using that text in its training data (it wants to train on human data). And of course AI B can do the same thing, to recognize its own text. But since each AI is using its own secret watermarks, there’s no good way to prevent A from accidentally training on B’s output. And vice versa.

The AI’s are supposed to only train on human data, to be more like humans. But maybe there will be a point where they unavoidably start training on each other. And then if there’s a malicious actor, they might intentionally use their AI to flood a popular public text data source with content that, if the other AI ingest it, will cause them to behave in a way that the actor wants (biased against their targets, or biased positively for the actor).

Effectively, at some point we may have to deal with people secretly using AI to advertise to, radicalize, or scam other AI. Unless we get some fairly global regulations up in time. Should be interesting.

I wonder to what extent we’ll manage to get science fiction out about these things before we start seeing them in practice.

7

EmmyNoetherRing t1_j5er3xp wrote

I’d heard they had added one, actually. Or were planning to— the concern they listed was they didn’t want the model accidentally training on its own output, as more of its output shows up online.

I have to imagine this is a situation where security by obscurity is unavoidable though, so if they do have a watermark we might not hear much about it. Otherwise malicious users would just clean it back out again.

We may end up with a situation where only a few people internal to OpenAI know how the watermark works, and they occasionally answer questions for law enforcement with the proper paperwork.

51

EmmyNoetherRing t1_j5253a8 wrote

>Softmax activation function

Ok, got it. huh (on reviewing wikipedia). so to rephrase the quoted paragraph, they find that the divergence between the training and testing distribution (between the compressed versions of the training and testing data sets in my analogy) starts decreasing smoothly as the scale of the model increases, long before the actual final task performance locks into place successfully.

Hm. Says something more about task complexity (maybe in some computability sense, a fundamental task complexity, that we don't have well defined for those types of tasks yet?). Rather than imagination I think, but I'm still with you on imagination being a factor, and of course the paper and the blog post both leave the cliff problem unsolved. Possibly there's a definition of imagination such that we can say degree X of it is needed to successfully complete those tasks.

1

EmmyNoetherRing t1_j51x98z wrote

> As an alternative evaluation, we measure cross-entropy loss, which is used in scaling laws for pre-training, for the six emergent BIG-Bench tasks, as detailed in Appendix A. This analysis follows the same experimental setup from BIG-Bench (2022) and affirms their conclusions for the six emergent tasks we consider. Namely, cross-entropy loss improves even for small model scales where the downstream metrics (exact match, BLEU, and accuracy) are close to random and do not improve, which shows that improvements in the log-likelihood of the target sequence can be masked by such downstream metrics. However, this analysis does not explain why downstream metrics are emergent or enable us to predict the scale at which emergence occurs. Overall, more work is needed to tease apart what enables scale to unlock emergent abilities.

Don't suppose you know what cross-entropy is?

1

EmmyNoetherRing t1_j510553 wrote

>Unfortunately, OpenAI aren't serious about publishing technical reports anymore.

Do OpenAI folks show up to any of the major research conferences? These days I mostly come into contact with AI when it wanders into the tech policy/governance world, and this seems like the sort of work that would get you invited to an OSTP workshop, but I'm not sure if that's actually happening.

OpenAI's latest not-so-technical report (on their website) has a few folks from Georgetown contributing to it, and since AAAI is in DC in a few weeks I was hoping OpenAI would be around and available for questions in some capacity, in some room at the conference.

5

EmmyNoetherRing t1_j0yrv8i wrote

Don’t forget to check for accuracy by illness category too. Humans have biases because of social issues, machines also pick up biases due to the relative shapes/distributions of the various concepts they’re trying to learn— they’ll do better on simpler ones and more common ones. You might get high accuracy on cold/flu cases that show up frequently in the corpus and have very simple treatment paths, and because they show up frequently that may bump up your overall accuracy. But at the same time you want to check how it’s handling less common cases whose diagnosis/treatment will likely be spread across multiple records over a period of time, like cancer or auto-immune issues.

It’s a good idea to verify that your simulation process isn’t accidentally stripping the diversity out of the original data, by generating instances of the rarer or more complex cases that are biased towards having traits from the simpler and more common cases (especially in this context that might result in some nonsensical record paths for more complex illnesses).

3