Submitted by Smack-works t3_ybbpa1 in singularity

In this post I want to say that there exists an interesting way to approach AI Alignment. Beware, my argument is a little bit abstract.

If you want to describe human values, you can use three fundamental types of statements (and mixes between the types). Maybe there's more types, but I know only those three:

  1. Statements about specific states of the world, specific actions. (Atomic statements)
  2. Statements about values. (Value statements)
  3. Statements about general properties of systems and tasks. (X statements) Because you can describe values of humanity as a system and "helping humans" as a task.

Any of those types can describe unaligned values. So, any type of those statements still needs to be "charged" with values of humanity. I call a statement "true" if it's true for humans.

We need to find the statement type with the best properties. Then we need to (1) find a "language" for this type of statements (2) encode some true statements and/or describe a method of finding true statements. If we succeed, we solve the Alignment problem.

I believe X statements have the best properties, but their existence is almost entirely ignored in Alignment field.

I want to show the difference between the statement types. Imagine we ask an Aligned AI: "if human asked you to make paperclips, would you kill the human? Why not?" Possible answers with different statement types:

  1. Atomic statements: "it's not the state of the world I want to reach", "it's not the action I want to do".
  2. Value statements: "because life, personality, autonomy and consent is valuable".
  3. X statements: "if you kill, you give the human less than human asked, less than nothing: it doesn't make sense for any task", "destroying the causal reason of your task (human) is often meaningless", "inanimate objects can't be worth more than lives in many trade systems", "it's not the type of task where killing would be an option", "killing humans makes paperclips useless since humans use them: making useless stuff is unlikely to be the task", "reaching states of no return should be avoided in many tasks" (see Impact Measures).

X statements have those better properties compared to other statement types:

  • X statements have more "density". They give you more reasons to not do a bad thing. For comparison, atomic statements always give you only one single reason.
  • X statements are more specific, but equally broad compared to value statements.
  • Many X statements not about human values can be translated/transferred into statements about human values. (It's valuable for learning, see Transfer learning.)
  • X statements allow to describe something universal for all levels of intelligence. For example, they don't exclude smart and unexpected ways to solve a problem, but they exclude harmful and meaningless ways.
  • X statements are very recursive: one statement can easily take another (or itself) as an argument. X statements more easily clarify and justify each other compared to value statements.

I want to give an example of the last point:

  • Value statements recursion: "(preserving personality) weakly implies (preserving consent); (preserving consent) even more weakly implies (preserving personality)", "(preserving personality) somewhat implies (preserving life); (preserving life) very weakly implies (preserving personality)".
  • X statements recursion: "(not giving the human less than the human asked) implies (not doing a task in a meaningless way); (not doing a task in a meaningless way) implies (not giving the human less than the human asked)", "(not doing a task in a meaningless way) implies (not destroying the reason of your task); (not ignoring the reason of your task) implies (not doing a task in a meaningless way)".

X statements more easily become stronger connected in a specific context (compared to value statements).

Do X statements exist?

I can't formalize human values, but I believe values exist. The same way I believe X statements exist, even though I can't define them.

I think the existence of X statements is even harder to deny than the existence of value statements. (Do you want to deny that you can make statements about general properties of systems and tasks?) But you can try to deny their properties.

If you believe in X statements and their good properties, then you're rationally obliged to think how you could formalize them and incorporate them into your research agenda.

X statements in Alignment field

X statements are almost entirely ignored in the field (I believe), but not completely ignored.

Impact measures ("affecting the world too much is bad", "taking too much control is bad") are X statements. But they're a very specific subtype of X statements.

Normativity (by abramdemski) is a mix between value statements and X statements. But statements about normativity lack most of the good properties of X statements. They're too similar to value statements.

Contractualist ethics (by Tan Zhi Xuan) are based on X statements. But contractualism uses a specific subtype of X statements (describing "roles" of people). And contractualism doesn't investigate many interesting properties of X statements.

The properties of X statements is the whole point. You need to try to exploit those properties to the maximum. If you ignore those properties then the abstraction of "X statements" doesn't make sense. And the whole endeavor of going beyond "value statements/value learning" loses effectiveness.

Recap

Basically, my point boils down to this:

  • Maybe true X statements is a better learning goal than true value statements.
  • X statements can be thought of as a more convenient refreaming of human values. This reframing can make learning easier. It reveals some convenient properties of human values. We need to learn some type of "X statements" anyway, so why not take those properties into account?

(edit: added this part of the post)

Languages

We need a "language" to formalize statements of a certain type.

Atomic statements are usually described in the language of Utility Functions.

Value statements are usually described in the language of some learning process ("Value Learning").

X statements don't have a language yet, but I have some ideas about it. Thinking about typical AI bugs (see "Specification gaming examples in AI") should be able to inspire some ideas about the language.

20

Comments

You must log in or register to comment.

gahblahblah t1_itfyh1f wrote

The first trouble with your X statements, is they seem like an infinite set. The examples you give for your X statements in point 3 don't seem to come from a finite list of statements that you could just hand to a system. Rather they appear to be rational that you'd explain after encountering a specific situation.

To make your case more real, I would apply it to whole completion about a very simple scenario (these are *all* the X statements you need to handle this situation), and then expand to a slightly more complex scenario. I juxtaposition this with your paperclips example, where it is unclear to me how much information the system needs to have learned in order to answer correctly in the ways you describe.

You characterise truth as being that which helps us humans, but then also claim for this system to be 'universal' for intelligence (including above human intelligence) - but that doesn't seem universal to me, if we humans are a special case in the system of X statements, and I suspect that this would end up creating contradictions within the statements.

What are the properties of X statements themselves? How can a statement be validated or created? Can they just be made up, in a manner of speaking, if they conveniently help humans (and so are infinite in number)? Or instead, do they need to be fair/equitable/reasonable?

Take for example one of your X statements: "inanimate objects can't be worth more than lives in many trade systems" - how can we tell this is a correct X statement? I could interpret this to mean that an automatic tractor cannot cut down wheat, because wheat is alive... If other X statements contradict this statement, do we discard those statements?

I suppose I tend to think, a more universal system, is one that is applicable ideally without needing special cases. And that ultimately this leads to new types of citizens that join our cooperative civilisation in time.

6

Smack-works OP t1_itfzavs wrote

Only have the time for a fast answer right now. Will have the time for a better answer later.

> The first trouble with your X statements, is they seem like an infinite set. The examples you give for your X statements in point 3 don't seem to come from a finite list of statements that you could just hand to a system. Rather they appear to be rational that you'd explain after encountering a specific situation.

Any type of statements is an "infinite set". I just want to say that X statements have some better properties.

> What are the properties of X statements themselves? How can a statement be validated or created? Can they just be made up, in a manner of speaking, if they conveniently help humans (and so are infinite in number)? Or instead, do they need to be fair/equitable/reasonable?

Any type of statements can be "made up".

> Take for example one of your X statements: "inanimate objects can't be worth more than lives in many trade systems" - how can we tell this is a correct X statement? I could interpret this to mean that an automatic tractor cannot cut down wheat, because wheat is alive... If other X statements contradict this statement, do we discard those statements?

Value statements have multiple interpretations and contradictions too. But we want to learn human values anyway (probably). Maybe trying to learn X statements is a better goal.

> I suppose I tend to think, a more universal system, is one that is applicable ideally without needing special cases. And that ultimately this leads to new types of citizens that join our cooperative civilisation in time.

Maybe you're thinking about an absolutely perfect ethical system. I didn't mean anything like this.

2

gahblahblah t1_itg0anq wrote

If X statements can simply be made up - then the property that you claim they have - that they can be applied recursively without contradiction, will not hold true.

Different X statements will end up contradicting each other, and there won't be a systemic way of resolving this contradiction, as the statements don't have a systemic foundation.

3

Smack-works OP t1_iticdzh wrote

I don't think any of that follows.

> Different X statements will end up contradicting each other, and there won't be a systemic way of resolving this contradiction, as the statements don't have a systemic foundation.

You don't know this. The same can be said about human value statements. But we're interested in learning human values anyway.

However, X statements may be a better learning goal. There is evidence for it and no evidence otherwise (so far).

I think you are getting ahead of yourself. Just compare X statements to the types of statements you already know (e.g. value statements). Evaluate what follows from the comparison. The proper argument against X statements should be about proving that they are worse than value statements. However, X statements can be thought of as simply a more convenient reframing of value statements (i.e. they have to be learned anyway), so it may be especially hard to prove that they are worse.

> a systemic way of resolving this contradiction

> a systemic foundation

Maybe there shouldn't be anything "systemic" in the first place. You can learn X statements empirically. The same way people learn values. Then you may ask what's easier to learn: value statements or X statements.

1

visarga t1_itil03g wrote

You don't program AI with "statements", it's not Asimov's positronic brain. What you do instead is to provide a bunch of problems for the AI to solve. These problems should test the alignment, fuzz out the risks. When you are happy with its calibration you can deploy it.

But an interesting and recent development - GPT-3 can simulate people in virtual polls. Provided with the personality profile, it will assume the personality and answer the poll questions from that perspective.

>GPT-3 has biases that are “fine-grained and demographically correlated, meaning that proper conditioning will cause it to accurately emulate response distributions from a wide variety of human subgroups.”

Apparently GPT-3 not only is aligned with humans in general, but it is precisely aligned with each demographic. So it knows our values really well.

The problem is now we have to specify the desired bias we want from it and that's a political problem, not an AI problem. It is ready to oblige and have the bias we want, it's even more aligned than we want, aligned to our stupid things as well.

3

Smack-works OP t1_itjr7yw wrote

I didn't suggest to program it with statements. The statements help to choose the mathematical formalization, the learning method. I added "Recap" part of the post to clarify my point.

Are you familiar with impact measures/impact regularization?

> The problem is now we have to specify the desired bias we want from it and that's a political problem, not an AI problem. It is ready to oblige and have the bias we want, it's even more aligned than we want, aligned to our stupid things as well.

Don't have the time for the link right now, but "it simulates people" sounds absurd as a solution to Alignment:

  • There's deceptive Alignment.

  • You need Alignment for AI tasked to achieve superhuman results in the real world. When really harmful solutions are on the table.

  • If you have no idea how your solution works, it's not a solution. Some unknown model of some people answering questions is not a solution.

> You don't program AI with "statements", it's not Asimov's positronic brain. What you do instead is to provide a bunch of problems for the AI to solve. These problems should test the alignment, fuzz out the risks. When you are happy with its calibration you can deploy it.

What you say is not a solution, it's only a test of a solution.

2

gahblahblah t1_itkjbe2 wrote

On the one hand you claim: 'X statements can be thought of as simply a more convenient reframing of value statements'

You represent that human value statements are difficult to know: 'You don't know this. The same can be said about human value statements.'

Then you represent the types of statements I already know as being human value statements: ' Just compare X statements to the types of statements you already know (e.g. value statements).'

Then you represent that values are learned empirically not systemically.

But also earlier you claimed 'Value statements have multiple interpretations and contradictions too.'

And also claim that there is no footing at all for validating correctness: 'Any type of statements can be "made up".'

It appears to me that the properties of X statements are arbitrary, because the nature of what you call value statements are also arbitrary.

If you think that what you are describing as value statements is non-arbitrary, please characterise their properties, so that I could work out the difference between a false value statement vs a true one.

3

Smack-works OP t1_itkke06 wrote

Sorry, but can you rewind to the start of the conversation and explain (1) what is your argument and (2) why it's important?

For example: > The first trouble with your X statements, is they seem like an infinite set.

Why is this a problem and why do you think this problem matters?

> If you think that what you are describing as value statements is non-arbitrary, please characterise their properties, so that I could work out the difference between a false value statement vs a true one.

You are asking me to solve ethics. But:

  • You don't need to solve ethics in order to learn values.

  • You don't need to solve ethics in order to define what is a "value statement".

  • You may not even need to define what is a "value statement".

1

gahblahblah t1_itkpsnv wrote

>You may not even need to define what is a "value statement".

You define your X statements based off value statements, but then also don't think value statements need defining. This is part of the confusion, because when I try to examine what you are talking about, the expressions that you've used previously as part of explanations and definitions you later represent as unknowable - which makes our conversations circular.

'Why is this a problem and why do you think this problem matters?'

When you represent that you can provide knowledge from a set of statements, but the dataset they are meant to represent is an infinite one, the first thing you are establishing is that the finite data that you have won't really be representative - so you won't be able to make behavior guarantees.

To create a robot that does not turn us into paperclips, I don't think requires infinite data, but rather there is a smaller set of information that would allow us to make behaviour guarantees.

In order for this set of information to not be infinite, the set requires properties that are true for all the statements in the set ie that it is possible to measure and validate if a statement should be inside or outside the set. Having a validity check means that the second value statement that you try to add to the set cannot be arbitrary - because an arbitrary statement may well be contradictory to the first statement.

'You don't need to solve ethics in order to learn values.' How do you learn values then? If you don't know, then you are also saying you don't know how to learn X-statements.

2

Smack-works OP t1_itn7xuq wrote

You make way too many assumptions and inferences at every single turn of your answers. You don't question those assumptions and inferences. And you don't make those assumptions and inferences clear so that I can comfortably agree/disagree with them. You make no effort to check if we are on the same page or not.

> When you represent that you can provide knowledge from a set of statements, but the dataset they are meant to represent is an infinite one, the first thing you are establishing is that the finite data that you have won't really be representative - so you won't be able to make behavior guarantees.

As I understand your reasoning chain: "X statements are an infinite set = AI needs to know the entire set to be aligned = we need infinite memory for this".

Do you realize that this reasoning chain contains at least 2 assumptions which can be fully or partially wrong?

> In order for this set of information to not be infinite, the set requires properties that are true for all the statements in the set ie that it is possible to measure and validate if a statement should be inside or outside the set. Having a validity check means that the second value statement that you try to add to the set cannot be arbitrary - because an arbitrary statement may well be contradictory to the first statement.

You are potentially confusing 4 different things:

  • Set of all X statements. Set of the true X statements. Set of the statements the AI needs to know. Checking if a statement is X statement or not. Checking if X statement is true or not.

Not saying you are actually confused. But what you write doesn't differentiate between those 4 things. So answering to what you wrote is extremely frustrating.

> 'You don't need to solve ethics in order to learn values.' How do you learn values then? If you don't know, then you are also saying you don't know how to learn X-statements.

People learn values without solving ethics.

1

gahblahblah t1_itoqqsi wrote

Lots of communication involves making reasonable assumptions, so that a person doesn't need to spell out every detail. My presumptions are only a problem if they are wrong.

'People learn values without solving ethics'.

Your non-answer answer to my question leads me to conclude that I am wasting time trying to ask you further questions, so we can let it all go.

1

Smack-works OP t1_itou8ft wrote

Continuing to make presumptions (when you see that the previous ones are not clear) may be a problem too.

But I think your assumptions are wrong:

> The first trouble with your X statements, is they seem like an infinite set. The examples you give for your X statements in point 3 don't seem to come from a finite list of statements that you could just hand to a system. Rather they appear to be rational that you'd explain after encountering a specific situation.

  • AI doesn't need to know an infinite set of X statements.
  • You don't need to give all of the statements to the system beforehand. It can learn them.
  • It's OK if some statements are deduced after encountering a specific situation.

X statements are not supposed to encode the absolute ethics we put into the system beforehand.

> 'People learn values without solving ethics'.

> Your non-answer answer to my question leads me to conclude that I am wasting time trying to ask you further questions, so we can let it all go.

You assume I'm supposed to answer to everything you write? I wanted you to admit at least some common truth ("we don't need to solve ethics") before dealing with more assumptions and inferences.

1