Submitted by Liberty2012 t3_11ee7dt in singularity
hapliniste t1_jaebp9r wrote
Alignment will likely be a political issue, not a technological one.
We don't know how an AGI system would work, so we don't know how to solve it yet but it could very well be super simple technologically. A good plan would be to have two versions of the model, and have one be tasked to validate the actions of the second one. This way we could design complex rules that we couldnt code ourself. If the first model think the second model output is not aligned with the value we fed it, it will attribute a low score (or high loss) to the training element of the model (and refuse the output if it is in production).
The problem will be the 200 pages long list of rules that we would need to feed the scoring model, and make it fit most people interests. Also what if it is good for 90% of humanity but totally fuck 10%? That's the questions we will encounter, and that standard democracy might fail to solve best.
Liberty2012 OP t1_jaeezez wrote
Thanks, some good points to reason about!
Yes, this is somewhat the concept of evolving AGI in some competitive manner where we play AGIs against each other to compete for better containment.
There are several challenges, we don't really understand intelligence and at what point AI is potentially self aware. A self aware AI could potentially realize that the warden is playing the prisoners against each other and they could coordinate to deceive the guards so to speak.
And yes the complexity of the rules, however they are created, can be very problematic. Containment is really an abstract concept. It is so difficult to define what would be the boundaries and turn them into rules which will not have vulnerabilities.
Then ultimately, how can we ever know if the ASI has agency and is capable of self reflection that it will not eventually figure out how to jail break itself.
Surur t1_jaeezsj wrote
I think the RL-HF worked really well because the AI is basing its judgement not on a list of rules, but the nuanced rules it learnt itself from human feedback.
Just like most AI things, we can never encode strictly enough all the elements which guide our decisions, but using neural networks we are able to black-box it and get a workable system that has in some way captured the essence of the decision-making process we use.
Liberty2012 OP t1_jaejlry wrote
There is a recent observation that might question exactly how well this working. There seems to be a feedback loop causing a deceptive emergent behavior from the reinforcement learning.
https://bounded-regret.ghost.io/emergent-deception-optimization
Surur t1_jaem8nr wrote
It is interesting to me that
a) its possible to teach a LLM to be honest when we catch it in a lie.
b) if we ever get to the point where we can not detect a lie (eg. novel information) the AI is incentivised to lie every time.
Viewing a single comment thread. View all comments