Sure_Cicada_4459 OP t1_jefuu0z wrote

-Yes, but GPT-4 wasn't public till they did extensive red teaming. They looked at all the worst cases before letting it out, not that GPT-4 can't cause any damage by itself just not the kind ppl are freaked about.

-That is a given with the aforementioned arguments, ASI assumes superhuman ability on any task and metric. I really think if GPT-5 is showing this same trend that alignment ease scales with intelligence, people should seriously update their p(doom).

-My argument boils down that the standard of sufficiency can only be satisfied to the degree that one can't observe failure modes anymore, you can't arbitrarily satisfy it just like you can't observe anything smaller then Planck length. There is a finite resolution to this problem, whether it is limited by human cognition or infinite possible imagine substructures. We obvious need more interpretability research, and there are some recent trends like Reflexion, ILF and so on that will over the long term yield more insight into the behaviour of systems as you can work with "thoughts" in text form instead of inscrutable matrices. There will be likely some form of cognitive structures inspired by the human brain which will look more like our intuitive symbolic computations and allow us to measure these failure modes better. Misalignments on the lower level could still be possible ofc, but that doesn't say anything about the system on the whole, it could be load bearing in some way for example. That's why I think the only way one can approach this is empirical, and AI is largely an empirical science let's be real.


Sure_Cicada_4459 OP t1_jefepok wrote

With a sufficiently good world model, it will be aware of my level of precision of understanding given the context, it will be arbitrarily good at infering intent, it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state. That might be even the most likely scenario if it's forecasting ability and intent reading is vastly superior, so we don't even have to live through the negative outcome to debug future states. You can't really have such a vastly superior world model without also using the limitations of the understanding of the query by the user as a basis for your action calculation. In the end, there is a part that is unverifiable as I mentioned above but it is not relevant to forecasting behaviour kind of like how you can't confirm that anyone but yourself is conscious (and the implications of yes or no are irrelevant to human behaviour).

And that is usually the limit I hit with AI safety people, you can build arbitrary deceiving abstractions on a sub level that have no predictive influence on the upper one and are unfalsifiable until they again arbitrarily hit a failure mode in the undeterminable future. You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either but that's not how we do science yet technically you can't validate that this is not in fact how the universe happens to work. There is an irreducible risk to this whose level of attention is likely directly correlated to how neurotic one is. And since the stakes are infinite and the risk is non-zero, you do the math, that's enough fuel to build a lifetime of fantasies and justify any actions really. I believe the least talked about topic is that the criteria of trust are just as much dependent on the observer as the observed.

By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.


Sure_Cicada_4459 OP t1_jef5qx9 wrote

It's the difference between understanding and "simulating understanding", you can always refer to lower level processes and dismiss the abstract notion of "understanding", "following instructions",... It's a shorthand, but a sufficiently close simulacra would be indistinguishable from the "real" thing, because not understanding and simulating understanding to an insufficient degree will look the same when it fails. If I am just completing patterns I learned that simulate following instructions to such a high degree that there is no failure happening to distinguish it from "actually following instructions", then the lower level patterns ceases to be relevant to the description of the behaviour and therefore to the forecasting of the behaviour. It's just adding more complexity with the same outcome, that is it will reason from our instructions hence my above arguments.

To your last point, yes you'd have to find a set of statements that exhaustively filters out undesirable outcomes, but the only thing you have to get right on the first try is "don't kill, incapacitate, brain wash everyone." + "Be transparent about your actions and their reasons starting the logic chain from our query.". If you just ensure that, which by my previous argument is trivial you essentially have to debug it continiously as there will inevitably be undesirable consequences or futures ahead but that least remain steerable. Even if we end up in a simulation, it is still steerable as long as the aforementioned is ensured. We just "debug" from there but with the certainty that the action is reversable, and with more edge cases to add to our clauses. Like building any software really.


Sure_Cicada_4459 OP t1_jeexhxg wrote

It will reason from your instructions, the higher intelligence means the higher the fidelity to it's intent, that's why killing everyone wouldn't advance it's goal as it is a completely alien class of mind divorced from evolution whose drive is directly set by us. There is no winning, it's not playing the game of evolution like every lifeform you have ever met hence why it so hard to reason about this without projection.

Think about this way, in the scenario mentioned above when naively implemented it's most deceptive, most misaligned yet still goal achieving course of action is to deceive all your senses and put you in a simulation where it's more trivial in terms of ressource expenditure to satisfy your goals. But that would be as simple as adding that clause to your query, not saying it can't go wrong. I am saying it there are a set of statements that when interpreted with sufficient capabilities will eliminate these scenarios trivially.


Sure_Cicada_4459 OP t1_jeeqtzq wrote

The inability to verify is going to be almost inevitable as we go into ASI territory as it is feasible that there is no way of compressing certain patterns into human comprehensible territory, although I am thinking summarization, explaining,... ability will go hand in hand with increased capabilities allowing us to grasp things that would have been very likely out of our reach otherwise.

Deception is not necessary for this, and kind of has a similar dynamic to alignment in my eyes because the failure modes with intelligence are too similar. It's ofc environment dependent but deception tends to be a short term strategy that can give an advantage when actually accomplishing the task would cost more ressources or wouldn't serve it's goals. A sufficiently intelligent AI would have a sufficiently accurate world model to forecast over the long term, including the prob of detection, cost of keeping the lie coherent and so on. That would also include the possibility of modeling it's further capabilty increases, and likelyhood of achieving it's other goals. It would just be rly dumb, it's like why would a god pretend? I get why animals do so under high risk situations or with high short term pay off, but if you are virtually guaranteed the lightcone ressources you have 0 incentive to steer away from that. The ressources we would take to make us happy pets wouldn't even be in the territory of a rounding error vs the chance it's deception is noticed. It feel like the different between the reward for unaligned vs aligned AI for the AI itself is barely talked about, maybe cause the scale is absurd or there is too much reasoning with a scarcity mindset? Idk.


Sure_Cicada_4459 OP t1_jeea5kf wrote

One thing I keep seeing is that people have been making a buttload of assumptions that are tainted by decades of sci-fi and outdated thought. Higher Intelligence means better understanding of human concepts and values, which means easier to align. We can even see GPT-4 being better aligned then it's predecessors because it actually understands better: President of OpenAI (

In order to get to Yud's conclusions you'd have to maximize one dimension of optimization ability while completely ignoring many others that tend to calibrate human behaviour(reflection, reading intent,...) . It shows poor emotional intelligence, which is a common trait in the silicon valley types.


Sure_Cicada_4459 t1_jeace5e wrote

Actually no, And this is still an underestimate because predicting 10 years in algorithmic advances in the field of AI is silly. And that doesn't even account for distillation, more publicly available datasets and models, multi-LLM systems,... There are so many dimensions in which this train is running, it makes you dizzy thinking abt it and makes regulation look like nothing more then pure cope.


Sure_Cicada_4459 t1_jea3juc wrote

Good news, it's literally impossible. Even the assumption of that it's feasible to track GPU accumulation and therefore crack down on training runs above a certain size is very brittle. The incentive for obfuscation aside, we are just getting more and more efficient by the day meaning anyone will be able able to run GPT-n perf on their hardware soon. Even many signatories acknowledge how futile it is, but just want to signal that something needs to be done for whatever reasons (fill in your blanks).
Bad news, there is a non-trivial risk of this dynamic blowing up in our faces I just don't think restrictions are the way to go.


Sure_Cicada_4459 t1_je5w1bh wrote

Spin off project based on reflection, apparently GPT-4 gets 20% improvement in coding tasks:

People finetuning Llama using this prompt structure with much better results:

Someone already build an autonomous agent using feedback loops (not necessary related to reflexion):

Seems to yield performance improvement up to a certain point obviously, but it's also a very basic prompt stucture overall one can image all kinds of "cognitive structures"


Sure_Cicada_4459 t1_je4fln6 wrote

It reeks of sour grapes, not only are many of the signature fake which straight up puts this into at best shady af territory but there is literally zero workable plan after 6 months, hell even during it. No criteria as to what is "enough" pause and who decides them. And that also ignores that PAUSING DOESN'T WORK, there are all kinds of open source models out there and the tech is starting to move away from large = better. It's FOMO + desperate power grab + neurotic unfalsifiable fears. I am not saying x-risk is 0, but drastic action need commensurate evidence. I get tail risks are hard to get evidence for in advance, but we have seen so many ridiculous claims of misalignment like people coaxing ChatGPT or Bing into no-no talk and people claiming "It's aggressively misaligned", and yet at the very same time saying "It's hallucinating and doesn't understand anything abt reality". Everything abt this signals to me motivated reasoning, fears of obsolence, and projection of one's own demon onto a completely alien class of mind.


Sure_Cicada_4459 t1_jdzyefd wrote

We have different timelines it seems, hence why "you will be fine in the next few decades" which I interpret as "you will be able to do a meaningful economic task in some new niche" seems far fetched to me. My thought process is that the span of tasks that cover this is gigantic and would collapse most meaningful cognitive tasks into busy work. Which includes scientists, education, IT, psychology, roboticist,...

I am not saying we have AGI tomorrow, I am saying we will have AGI faster then any cognitive worker can realistically and sustainably pivot professions or someone can get themselves a degree. Also it is worth pointing out that the cognitive is the bottleneck on the mechanical. Even if we don't take into account that solving cognitive scarcity would mean the optimization problem of constructing efficient, cheap and useful robots is a matter of iteration and prompting, intelligently piloting even a badly designed and limited robot is much easier and yields much more useful applications then for example a dumb AI pilot piloting a hyper advanced fighter jet. Which in turn feedback loops in how permissible and cheap your designs for robots can get and so on.... And that doesn't even take into account the change in monetary incentives as that will attract massively more investment then there is now, breakthroughs and incentive evolve jointly after all.

GPT-4 runs on a big server and yet it still delivers reliable service to million, I don't think this will be a meaningful bottleneck, at least not one that should set your expectations for the next decades as any but "my niche has very limited shelf life and adaptation stretches plausibility instead of willingness or ability."


Sure_Cicada_4459 t1_jdzoc3j wrote

Not if AI learns to "logically and complete something complex by breaking it down into smaller tasks." and "keep learning new things and adapting to change". That's the point, the fact that you can run fast is irrelevant if the treadmill you are running on is accelerating at an increasing rate. The lessons people should really have learned by now is that every cognitive feat seems replicable, we are just benchmarks and we know what tends to happen to those lately.