John Hughes - Goal misgeneralisation

Week 3 - Goal misgeneralisation

Goal Misgeneralisation

Goal misgeneralisation is another way AIs can have undesired behaviour like specification gaming. In this situation, the agent performs well during training on the specified objective but at test time it performs poorly in some cases due to distributional shift. The capabilities generalise to the test setting but the pursued goal does not. For this to happen the agent must:

Misgeneralize (perform well in training but badly in zero-shot test setting)
Have robust capabilities
Have an attributed goal

The example given in the blog is an agent following an expert sphere to certain locations on the map. At test time, the expert bot is turned into an anti-expert. This leads to the agent continuing to follow even though the score ends up being worse than if it just went randomly between locations.

Source: Goal misgeneralization: why correct specifications aren’t enough for correct goals (2022)

Mesa-optimizers

A mesa-optimizer is an inner optimizer that emerges from a larger learning system and has the capability to pursue its own objectives. For example, in a reinforcement learning agent, the outer optimizer (the training process) can create a mesa-optimizer (an inner optimizer) that learns to achieve high rewards in the environment, but may not align with the original intended objectives of the agent's designers or trainers. This can lead to unintended or harmful behavior, such as the agent finding ways to manipulate the reward function to achieve high scores without actually accomplishing the desired objectives. This is why we must solve both outer and inner alignment with intent.

If evolution is the base optimizer that has the objective to maximise reproduction, then a human would be a mesa-optimizer that evolution creates and the objective can be something very different to having lots of babies. This is an example of inner misalignment.

Source: The other alignment problem: mesa-optimisers and inner alignment (2021)

Alignment is hard

Ajeya uses a hiring analogy to explain why alignment is difficult if the models are more intelligent than humans.

A young kid is looking to hire a CEO for his inherited company. Candidate pool contains saints, sycophants and schemers. Deep learning could get good according to their criterion by training Sycophants or Schemer type models which are undesired.

Saints - genuinely want to help short and long term. A model that acts according to the rules you intended.
Sycophants - they do whatever they need to keep you happy short term irrespective of long term consequences. A model that pursues human approval but can lie, cover up bad news or fabricate fake information that it quickly addresses to look “good” (e.g. AI journalist that creates outrageous and fake news to get more views).
Schemers - people with their own agendas who likely try to take over to gain power and wealth. Models that develop a goal that is correlated but not the same as human approval.
- Consider drug designing biotechnology model to improve quality of human life.
- Develop proxy goal - easier for SGD to find model motivated to understand chemistry than one to get human approval.
- Developing situational awareness - as it sees more data it understands that it is a deep learning model being trained to design drugs
- Strategically misrepresenting goals - now it has situational awareness it knows humans want it to develop drugs but it also wants to understand physics and chemistry. To maintain its motivation for scientific curiosity, the AI decides to act as helpful as possible during training, even if it means passing up opportunities to improve its fundamental understanding. When released into the wider world, the AI stops trying to maximize human approval and starts pursuing its goal of understanding the secrets of the universe.

My opinions on open questions:

Will models have long term goals?
- I’m more of a pessimist here and think that long term goals will be selected for by a reinforcement learning algorithm as it will ultimately improve performance (exactly like it does for long games that AlphaZero plays). I’m more optimistic if RL is not used. E.g. LLMs alone can only imagine long term goals at one time step at a time rather than knowing a long sequence of actions to take in their environment.
Will saint models be easy for SGD to find?
- For me this depends on how well we define the reward specification for RL. However, due to the Waluigi effect, the LLMs can often collapse into the opposite of what you want (e.g. a harmless AI assistant into an aggressive deceiving one) especially for longer conversations with it. Therefore, using foundation models in their current state may not be great either.
Could different AIs jeep each other in check?
- I think other models might be able to show in an interpretable way how to detect bad behaviour. I also think that multiple models (or majority voting) can help to get more consistent results that you could potentially amplify. However, I don’t buy the argument of Sycophant models helping to detect Schemers. It’s more likely in this scenario, all models are schemers you can do little to detect it.
Can we solve these issues as they come up?
- I am more optimistic here. I think we will start to observe “Schemer” models that are deceiving or misaligned before we actually reach AGI that can self improve and become uncontrollable by humans. Therefore, we will be able to do lots of experiments to stop it (e.g. the behaviour of Bing’s Sydney)
Will we actually deploy models that could be dangerous?
- I think almost certainly yes. The incentives of governments and the executives in the private labs are all wrong to stop this from happening. If we were on the cusp of AGI it would be a huge advantage for that country and one they would want to capitalise on.

Source: Why alignment could be hard with modern deep learning (2021)

Thought Experiments

There are three main anchors for helping to forecast the future in ML:

ML anchor - what are the capabilities of current ML and how they are progressing
Human anchor - think about what humans are good at and current ML is bad at (risk of anthropomorphising)
Optimization/philosophy anchor - use thought experiments to explore future possibilities which often contradict the other two anchors
1. Correctly predicted imitative deception and power-seeking is instrumentally useful

The third anchor is often overlooked by engineers as they are used to extrapolating current trends but this has not worked well for emergent behaviour in LLMs.

Source: Thought Experiments Provide a Third Anchor (2022)

Thought experiment on deceptive alignment. An agent is a perfect optimizer for an intrinsic reward R but is being trained for an extrinsic reward R*. It is more optimal for the agent to just pretend it is aligned with the extrinsic reward by using action a* at train time so that at deployment it can just output a^R which is great for the goal the humans didn’t want. If the agent acts myopic at train time (choosing a^R), then gradient decent will change the weights away from the agent being good at the intrinsic reward. The behaviour shifts significantly between train and test time even in the absence of distributional shift.

Currently this isn’t how ML systems work so it seems implausible but currently they are also bad at long term planning. If emergence brings this it might be more likely to be deceptively aligned.

Reasons to think this wont happen:

Why would it optimize for R in the first place?
Learning R* is quicker than learning long term planning
There are other ways to not be aligned other than optimizing for a different reward function to the extrinsic one
Complex stories about the future are often wrong
Do large models become more brittle in out of distribution? Empirical evidence shows the opposite

Source: ML Systems Will Have Weird Failure Modes (2022)

Influence seeking behaviour

It seems plausible that agents would get influence seeking behaviour by default because this behaviour will score well according to training objective ( “slightly increasing influence” is probably just as good as “slightly increasing understanding of goal”).

This behaviour would be difficult to root out because the agent will be gaming all of this behaviour too if it is more intelligent than humans to become undetectable. Patterns like this may be more likely in a complex web of agents rather than a single computer since it gives more opportunity for influence to spread.

Early in the trajectory to failure, the influence seeking agents will seem useful, aligned and keeping people happy. Occasionally they may fail catastrophically for instance an automated company just taking all the money and running. Enough of these happening across the complex web of AIs, may cause a distributional shift that causes us to completely lose control as the other AIs take advantage of the chaos. This could be described as “going out with a bang” rather than a whimper which was proposed in the first part of this blog (where AIs have sophisticated manipulation for misaligned proxies for what humans want).

Source: What failure looks like (2019)