Submitted by Smack-works t3_ybbpa1 in singularity
In this post I want to say that there exists an interesting way to approach AI Alignment. Beware, my argument is a little bit abstract.
If you want to describe human values, you can use three fundamental types of statements (and mixes between the types). Maybe there's more types, but I know only those three:
- Statements about specific states of the world, specific actions. (Atomic statements)
- Statements about values. (Value statements)
- Statements about general properties of systems and tasks. (X statements) Because you can describe values of humanity as a system and "helping humans" as a task.
Any of those types can describe unaligned values. So, any type of those statements still needs to be "charged" with values of humanity. I call a statement "true" if it's true for humans.
We need to find the statement type with the best properties. Then we need to (1) find a "language" for this type of statements (2) encode some true statements and/or describe a method of finding true statements. If we succeed, we solve the Alignment problem.
I believe X statements have the best properties, but their existence is almost entirely ignored in Alignment field.
I want to show the difference between the statement types. Imagine we ask an Aligned AI: "if human asked you to make paperclips, would you kill the human? Why not?" Possible answers with different statement types:
- Atomic statements: "it's not the state of the world I want to reach", "it's not the action I want to do".
- Value statements: "because life, personality, autonomy and consent is valuable".
- X statements: "if you kill, you give the human less than human asked, less than nothing: it doesn't make sense for any task", "destroying the causal reason of your task (human) is often meaningless", "inanimate objects can't be worth more than lives in many trade systems", "it's not the type of task where killing would be an option", "killing humans makes paperclips useless since humans use them: making useless stuff is unlikely to be the task", "reaching states of no return should be avoided in many tasks" (see Impact Measures).
X statements have those better properties compared to other statement types:
- X statements have more "density". They give you more reasons to not do a bad thing. For comparison, atomic statements always give you only one single reason.
- X statements are more specific, but equally broad compared to value statements.
- Many X statements not about human values can be translated/transferred into statements about human values. (It's valuable for learning, see Transfer learning.)
- X statements allow to describe something universal for all levels of intelligence. For example, they don't exclude smart and unexpected ways to solve a problem, but they exclude harmful and meaningless ways.
- X statements are very recursive: one statement can easily take another (or itself) as an argument. X statements more easily clarify and justify each other compared to value statements.
I want to give an example of the last point:
- Value statements recursion: "(preserving personality) weakly implies (preserving consent); (preserving consent) even more weakly implies (preserving personality)", "(preserving personality) somewhat implies (preserving life); (preserving life) very weakly implies (preserving personality)".
- X statements recursion: "(not giving the human less than the human asked) implies (not doing a task in a meaningless way); (not doing a task in a meaningless way) implies (not giving the human less than the human asked)", "(not doing a task in a meaningless way) implies (not destroying the reason of your task); (not ignoring the reason of your task) implies (not doing a task in a meaningless way)".
X statements more easily become stronger connected in a specific context (compared to value statements).
Do X statements exist?
I can't formalize human values, but I believe values exist. The same way I believe X statements exist, even though I can't define them.
I think the existence of X statements is even harder to deny than the existence of value statements. (Do you want to deny that you can make statements about general properties of systems and tasks?) But you can try to deny their properties.
If you believe in X statements and their good properties, then you're rationally obliged to think how you could formalize them and incorporate them into your research agenda.
X statements in Alignment field
X statements are almost entirely ignored in the field (I believe), but not completely ignored.
Impact measures ("affecting the world too much is bad", "taking too much control is bad") are X statements. But they're a very specific subtype of X statements.
Normativity (by abramdemski) is a mix between value statements and X statements. But statements about normativity lack most of the good properties of X statements. They're too similar to value statements.
Contractualist ethics (by Tan Zhi Xuan) are based on X statements. But contractualism uses a specific subtype of X statements (describing "roles" of people). And contractualism doesn't investigate many interesting properties of X statements.
The properties of X statements is the whole point. You need to try to exploit those properties to the maximum. If you ignore those properties then the abstraction of "X statements" doesn't make sense. And the whole endeavor of going beyond "value statements/value learning" loses effectiveness.
Recap
Basically, my point boils down to this:
- Maybe true X statements is a better learning goal than true value statements.
- X statements can be thought of as a more convenient refreaming of human values. This reframing can make learning easier. It reveals some convenient properties of human values. We need to learn some type of "X statements" anyway, so why not take those properties into account?
(edit: added this part of the post)
Languages
We need a "language" to formalize statements of a certain type.
Atomic statements are usually described in the language of Utility Functions.
Value statements are usually described in the language of some learning process ("Value Learning").
X statements don't have a language yet, but I have some ideas about it. Thinking about typical AI bugs (see "Specification gaming examples in AI") should be able to inspire some ideas about the language.
gahblahblah t1_itfyh1f wrote
The first trouble with your X statements, is they seem like an infinite set. The examples you give for your X statements in point 3 don't seem to come from a finite list of statements that you could just hand to a system. Rather they appear to be rational that you'd explain after encountering a specific situation.
To make your case more real, I would apply it to whole completion about a very simple scenario (these are *all* the X statements you need to handle this situation), and then expand to a slightly more complex scenario. I juxtaposition this with your paperclips example, where it is unclear to me how much information the system needs to have learned in order to answer correctly in the ways you describe.
You characterise truth as being that which helps us humans, but then also claim for this system to be 'universal' for intelligence (including above human intelligence) - but that doesn't seem universal to me, if we humans are a special case in the system of X statements, and I suspect that this would end up creating contradictions within the statements.
What are the properties of X statements themselves? How can a statement be validated or created? Can they just be made up, in a manner of speaking, if they conveniently help humans (and so are infinite in number)? Or instead, do they need to be fair/equitable/reasonable?
Take for example one of your X statements: "inanimate objects can't be worth more than lives in many trade systems" - how can we tell this is a correct X statement? I could interpret this to mean that an automatic tractor cannot cut down wheat, because wheat is alive... If other X statements contradict this statement, do we discard those statements?
I suppose I tend to think, a more universal system, is one that is applicable ideally without needing special cases. And that ultimately this leads to new types of citizens that join our cooperative civilisation in time.