Viewing a single comment thread. View all comments

Sure_Cicada_4459 OP t1_jefepok wrote

With a sufficiently good world model, it will be aware of my level of precision of understanding given the context, it will be arbitrarily good at infering intent, it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state. That might be even the most likely scenario if it's forecasting ability and intent reading is vastly superior, so we don't even have to live through the negative outcome to debug future states. You can't really have such a vastly superior world model without also using the limitations of the understanding of the query by the user as a basis for your action calculation. In the end, there is a part that is unverifiable as I mentioned above but it is not relevant to forecasting behaviour kind of like how you can't confirm that anyone but yourself is conscious (and the implications of yes or no are irrelevant to human behaviour).

And that is usually the limit I hit with AI safety people, you can build arbitrary deceiving abstractions on a sub level that have no predictive influence on the upper one and are unfalsifiable until they again arbitrarily hit a failure mode in the undeterminable future. You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either but that's not how we do science yet technically you can't validate that this is not in fact how the universe happens to work. There is an irreducible risk to this whose level of attention is likely directly correlated to how neurotic one is. And since the stakes are infinite and the risk is non-zero, you do the math, that's enough fuel to build a lifetime of fantasies and justify any actions really. I believe the least talked about topic is that the criteria of trust are just as much dependent on the observer as the observed.

By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

2

FeepingCreature t1_jefl3ya wrote

> By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

Have you met people. The internet was trying to hook GPT-4 up to unprotected shells within a day of release.

> it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state

Sure if I have successfully trained it to want to optimize for my sense of negative rather than its proxy for my proxy for my sense of negative. Also if my sense of negative matches my actual dispreference. Keep in mind that failure can look very similar to success at first.

> You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either

Right, which is why we need to understand what the models are actually doing, not just train-and-hope.

We're not saying it's unknowable, we're saying what we're currently doing is in no way sufficient to know.

1

Sure_Cicada_4459 OP t1_jefuu0z wrote

-Yes, but GPT-4 wasn't public till they did extensive red teaming. They looked at all the worst cases before letting it out, not that GPT-4 can't cause any damage by itself just not the kind ppl are freaked about.

-That is a given with the aforementioned arguments, ASI assumes superhuman ability on any task and metric. I really think if GPT-5 is showing this same trend that alignment ease scales with intelligence, people should seriously update their p(doom).

-My argument boils down that the standard of sufficiency can only be satisfied to the degree that one can't observe failure modes anymore, you can't arbitrarily satisfy it just like you can't observe anything smaller then Planck length. There is a finite resolution to this problem, whether it is limited by human cognition or infinite possible imagine substructures. We obvious need more interpretability research, and there are some recent trends like Reflexion, ILF and so on that will over the long term yield more insight into the behaviour of systems as you can work with "thoughts" in text form instead of inscrutable matrices. There will be likely some form of cognitive structures inspired by the human brain which will look more like our intuitive symbolic computations and allow us to measure these failure modes better. Misalignments on the lower level could still be possible ofc, but that doesn't say anything about the system on the whole, it could be load bearing in some way for example. That's why I think the only way one can approach this is empirical, and AI is largely an empirical science let's be real.

2