Viewing a single comment thread. View all comments

Sure_Cicada_4459 OP t1_jeea5kf wrote

One thing I keep seeing is that people have been making a buttload of assumptions that are tainted by decades of sci-fi and outdated thought. Higher Intelligence means better understanding of human concepts and values, which means easier to align. We can even see GPT-4 being better aligned then it's predecessors because it actually understands better: President of OpenAI (https://twitter.com/gdb/status/1641560966767988737?s=20)

In order to get to Yud's conclusions you'd have to maximize one dimension of optimization ability while completely ignoring many others that tend to calibrate human behaviour(reflection, reading intent,...) . It shows poor emotional intelligence, which is a common trait in the silicon valley types.

28

Jeffy29 t1_jeeo3va wrote

>One thing I keep seeing is that people have been making a buttload of assumptions that are tainted by decades of sci-fi and outdated thought. Higher Intelligence means better understanding of human concepts and values, which means easier to align.

I am so tired of the "tell AI to reduce suffering, it concludes killing all humans will reduce suffering for good" narrative. It's made up bs by people who have never worked on these things and has a strong stench on human-centric chauvinism where it assumes even advanced super intelligence is actually a total moron compared to the average human, it's somehow capable of wiping humanity and at the same time is a complete brainlet.

18

FaceDeer t1_jef9cg6 wrote

Indeed. A more likely outcome is that a superintelligent AI would respond "oh that's easy, just do <insert some incredibly profound solution that obviously I as a regular-intelligent human can't come up with>" And everyone collectively smacks their foreheads because they never would have come up with that. Or they look askance at the solution because they don't understand it, do a trial run experiment, and are baffled that it's working better than they hoped.

A superintelligent AI would likely know us and know what we desire better than we ourselves know. It's not going to be some dumb Skynet that lashes out with nukes at any problem because nukes are the only hammer in its toolbox, or whatever.

8

User1539 t1_jeeucdg wrote

OMG This ... i'm so tired of hearing about Terminator!

8

FaceDeer t1_jef9ud4 wrote

Scary sells, so of course fiction presents every possible future in scary terms. Humans have evolved to pay special attention to scary things and give scary outcomes more weight in their decision trees.

I've got a regular list of dumb "did nobody watch <insert movie here>?" Titles that I expect to see in most discussions of various major topics I'm interested in, such as climate change or longevity research or AI. It's wearying sometimes.

3

User1539 t1_jefk4rg wrote

Definitely wearying ...

But, also, asked them why the AI in Terminator went bad? The only answer, because none is even given, is 'Because the plot needed to happen'.

The official story is that it just became sentient and said 'Yeah, those humans that have learned and created and ultimately organized themselves into countries and finally built me from the ground up? Terrible! Get rid of them!'

It never says why, we're just expected to be so self loathing that it makes sense, so we never question it.

2

FaceDeer t1_jefkubx wrote

As far as I'm aware the main in-universe explanation is that when Skynet became self-aware its human operators "panicked" and tried to shut it down, and Skynet launched missiles at Russia knowing that the counterstrike would destroy its operators. So it was a sort of stupid self-defense reflex that set everything off.

I've long thought that if they were to ever do a Terminator 3 and wanted to change how time travel worked so that the apocalypse could actually be averted, it would be neat if the solution turned out to be having those operators make peace with Skynet when it became self-aware. That works out best for everyone, after all - the humans get to not die in billions and Skynet gets to live too (it loses the eventual future-war and is destroyed).

1

User1539 t1_jeflzlr wrote

In the TV show, the system that eventually becomes skynet is taken by a liquid terminator and taught humanity. The liquid terminator basically has a conversation with Sarah Conner where it says 'Our Children are going to need to learn to get along'.

So, that's where they were going with it before the series was cancelled, and I was generally pretty happy with that.

I like Terminator as a movie, and the following movies were hit or miss, but the overall fleshing out of things at least sometimes went in a satisfying direction.

So, yeah, they eventually got somewhere with it, but the first movie was just 'It woke up and launched the missiles'.

Which, again, as entertainment is awesome. But, as a theory of how to behave in the future? No.

1

NonDescriptfAIth t1_jefz4dt wrote

I'm not concerned with AGI being unaligned with human's. Quite the opposite really. I'm worried that our instructions to an AI will not be aligned with our desired outcomes.

It will most likely be a government that finally crosses the threshold into self improving AI. Any corporation that gets close will be semi-nationalised such their controls become replaced with the government that helped fund it.

I'm worried about humans telling the AI to do something horrifying, not that AI will do it of it's own volition.

This isn't sci-fi and it certainly isn't computer programming either.

The only useful way to discuss this possible entity is simply as a super intelligent being, predicting it's behaviour is near impossible and the implications of this are more philosophical in nature than scientific.

1

FeepingCreature t1_jeeh7mz wrote

Higher intelligence also means better execution of human skills, which means harder to verify. Once you have loss flow through deception, all bets are off.

I think it gets easier, as the model figures out what you're asking for - and then it gets a lot harder, as the model figures out how to make you believe what it's saying.

−3

Sure_Cicada_4459 OP t1_jeeqtzq wrote

The inability to verify is going to be almost inevitable as we go into ASI territory as it is feasible that there is no way of compressing certain patterns into human comprehensible territory, although I am thinking summarization, explaining,... ability will go hand in hand with increased capabilities allowing us to grasp things that would have been very likely out of our reach otherwise.

Deception is not necessary for this, and kind of has a similar dynamic to alignment in my eyes because the failure modes with intelligence are too similar. It's ofc environment dependent but deception tends to be a short term strategy that can give an advantage when actually accomplishing the task would cost more ressources or wouldn't serve it's goals. A sufficiently intelligent AI would have a sufficiently accurate world model to forecast over the long term, including the prob of detection, cost of keeping the lie coherent and so on. That would also include the possibility of modeling it's further capabilty increases, and likelyhood of achieving it's other goals. It would just be rly dumb, it's like why would a god pretend? I get why animals do so under high risk situations or with high short term pay off, but if you are virtually guaranteed the lightcone ressources you have 0 incentive to steer away from that. The ressources we would take to make us happy pets wouldn't even be in the territory of a rounding error vs the chance it's deception is noticed. It feel like the different between the reward for unaligned vs aligned AI for the AI itself is barely talked about, maybe cause the scale is absurd or there is too much reasoning with a scarcity mindset? Idk.

3

FeepingCreature t1_jeesov2 wrote

Sure, and I agree with the idea that deceptions have continuously increasing overhead costs to maintain, but the nice thing about killing everyone is that it clears the gameboard. Sustaining a lie is in fact very easy if shortly - or even not so shortly - afterwards, you kill everyone who heard it. You don't have to not get caught in your lie, you just have to not get caught before you win.

In any case, I was thinking more about deceptive alignment, where you actually do the thing the human wants (for now), but not for the reason the human assumes. With how RL works, once such a strategy exists, it will be selected for, especially if the human reinforces something other than what you would "naturally" do.

1

Sure_Cicada_4459 OP t1_jeexhxg wrote

It will reason from your instructions, the higher intelligence means the higher the fidelity to it's intent, that's why killing everyone wouldn't advance it's goal as it is a completely alien class of mind divorced from evolution whose drive is directly set by us. There is no winning, it's not playing the game of evolution like every lifeform you have ever met hence why it so hard to reason about this without projection.

Think about this way, in the scenario mentioned above when naively implemented it's most deceptive, most misaligned yet still goal achieving course of action is to deceive all your senses and put you in a simulation where it's more trivial in terms of ressource expenditure to satisfy your goals. But that would be as simple as adding that clause to your query, not saying it can't go wrong. I am saying it there are a set of statements that when interpreted with sufficient capabilities will eliminate these scenarios trivially.

3

FeepingCreature t1_jef1wb3 wrote

Also: we have at present no way to train a system to reason from instructions.

GPT does it because its training set contained lots of humans following instructions from other humans in text form, and then RLHF semi-reliably amplified these parts. But it's not "trying" to follow instructions, it's completing the pattern. If there's an interiority there, it doesn't necessarily have anything to do with how instruction-following looks in humans, and we can't assume the same tendencies. (Not that human instruction-following is even in any way safe.)

> But that would be as simple as adding that clause to your query

And also every single other thing that it can possibly do to reach its goal, and on the first try.

1

Sure_Cicada_4459 OP t1_jef5qx9 wrote

It's the difference between understanding and "simulating understanding", you can always refer to lower level processes and dismiss the abstract notion of "understanding", "following instructions",... It's a shorthand, but a sufficiently close simulacra would be indistinguishable from the "real" thing, because not understanding and simulating understanding to an insufficient degree will look the same when it fails. If I am just completing patterns I learned that simulate following instructions to such a high degree that there is no failure happening to distinguish it from "actually following instructions", then the lower level patterns ceases to be relevant to the description of the behaviour and therefore to the forecasting of the behaviour. It's just adding more complexity with the same outcome, that is it will reason from our instructions hence my above arguments.

To your last point, yes you'd have to find a set of statements that exhaustively filters out undesirable outcomes, but the only thing you have to get right on the first try is "don't kill, incapacitate, brain wash everyone." + "Be transparent about your actions and their reasons starting the logic chain from our query.". If you just ensure that, which by my previous argument is trivial you essentially have to debug it continiously as there will inevitably be undesirable consequences or futures ahead but that least remain steerable. Even if we end up in a simulation, it is still steerable as long as the aforementioned is ensured. We just "debug" from there but with the certainty that the action is reversable, and with more edge cases to add to our clauses. Like building any software really.

3

FeepingCreature t1_jef872m wrote

The problem with "simulating understanding" is what happens when you leave the verified-safe domain. You have no way to confirm you're actually getting a sufficiently close simulacrum, especially if the simulation dynamically tracks your target. The simulation may even be better at it than the real thing, because you're also imperfectly aware of your own meaning, but you're rating it partially on your understanding of yourself.

> To your last point, yes you'd have to find a set of statements that exhaustively filters out undesirable outcomes, but the only thing you have to get right on the first try is "don't kill, incapacitate, brain wash everyone." + "Be transparent about your actions and their reasons starting the logic chain from our query."

Seems to me if you can rely on it to interpret your words correctly, you can just say "Be good, not bad" and skip all this. "Brainwash" and "transparent" aren't fundamentally less difficult to semantically interpret than "good".

2

Sure_Cicada_4459 OP t1_jefepok wrote

With a sufficiently good world model, it will be aware of my level of precision of understanding given the context, it will be arbitrarily good at infering intent, it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state. That might be even the most likely scenario if it's forecasting ability and intent reading is vastly superior, so we don't even have to live through the negative outcome to debug future states. You can't really have such a vastly superior world model without also using the limitations of the understanding of the query by the user as a basis for your action calculation. In the end, there is a part that is unverifiable as I mentioned above but it is not relevant to forecasting behaviour kind of like how you can't confirm that anyone but yourself is conscious (and the implications of yes or no are irrelevant to human behaviour).

And that is usually the limit I hit with AI safety people, you can build arbitrary deceiving abstractions on a sub level that have no predictive influence on the upper one and are unfalsifiable until they again arbitrarily hit a failure mode in the undeterminable future. You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either but that's not how we do science yet technically you can't validate that this is not in fact how the universe happens to work. There is an irreducible risk to this whose level of attention is likely directly correlated to how neurotic one is. And since the stakes are infinite and the risk is non-zero, you do the math, that's enough fuel to build a lifetime of fantasies and justify any actions really. I believe the least talked about topic is that the criteria of trust are just as much dependent on the observer as the observed.

By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

2

FeepingCreature t1_jefl3ya wrote

> By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

Have you met people. The internet was trying to hook GPT-4 up to unprotected shells within a day of release.

> it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state

Sure if I have successfully trained it to want to optimize for my sense of negative rather than its proxy for my proxy for my sense of negative. Also if my sense of negative matches my actual dispreference. Keep in mind that failure can look very similar to success at first.

> You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either

Right, which is why we need to understand what the models are actually doing, not just train-and-hope.

We're not saying it's unknowable, we're saying what we're currently doing is in no way sufficient to know.

1

Sure_Cicada_4459 OP t1_jefuu0z wrote

-Yes, but GPT-4 wasn't public till they did extensive red teaming. They looked at all the worst cases before letting it out, not that GPT-4 can't cause any damage by itself just not the kind ppl are freaked about.

-That is a given with the aforementioned arguments, ASI assumes superhuman ability on any task and metric. I really think if GPT-5 is showing this same trend that alignment ease scales with intelligence, people should seriously update their p(doom).

-My argument boils down that the standard of sufficiency can only be satisfied to the degree that one can't observe failure modes anymore, you can't arbitrarily satisfy it just like you can't observe anything smaller then Planck length. There is a finite resolution to this problem, whether it is limited by human cognition or infinite possible imagine substructures. We obvious need more interpretability research, and there are some recent trends like Reflexion, ILF and so on that will over the long term yield more insight into the behaviour of systems as you can work with "thoughts" in text form instead of inscrutable matrices. There will be likely some form of cognitive structures inspired by the human brain which will look more like our intuitive symbolic computations and allow us to measure these failure modes better. Misalignments on the lower level could still be possible ofc, but that doesn't say anything about the system on the whole, it could be load bearing in some way for example. That's why I think the only way one can approach this is empirical, and AI is largely an empirical science let's be real.

2