Week 3 - Goal misgeneralisation

Goal Misgeneralisation

Goal misgeneralisation is another way AIs can have undesired behaviour like specification gaming. In this situation, the agent performs well during training on the specified objective but at test time it performs poorly in some cases due to distributional shift. The capabilities generalise to the test setting but the pursued goal does not. For this to happen the agent must:

The example given in the blog is an agent following an expert sphere to certain locations on the map. At test time, the expert bot is turned into an anti-expert. This leads to the agent continuing to follow even though the score ends up being worse than if it just went randomly between locations.

Source: Goal misgeneralization: why correct specifications aren’t enough for correct goals (2022)

Mesa-optimizers

A mesa-optimizer is an inner optimizer that emerges from a larger learning system and has the capability to pursue its own objectives. For example, in a reinforcement learning agent, the outer optimizer (the training process) can create a mesa-optimizer (an inner optimizer) that learns to achieve high rewards in the environment, but may not align with the original intended objectives of the agent's designers or trainers. This can lead to unintended or harmful behavior, such as the agent finding ways to manipulate the reward function to achieve high scores without actually accomplishing the desired objectives. This is why we must solve both outer and inner alignment with intent.

If evolution is the base optimizer that has the objective to maximise reproduction, then a human would be a mesa-optimizer that evolution creates and the objective can be something very different to having lots of babies. This is an example of inner misalignment.

Source: The other alignment problem: mesa-optimisers and inner alignment (2021)

Alignment is hard

Ajeya uses a hiring analogy to explain why alignment is difficult if the models are more intelligent than humans.

A young kid is looking to hire a CEO for his inherited company. Candidate pool contains saints, sycophants and schemers. Deep learning could get good according to their criterion by training Sycophants or Schemer type models which are undesired.

My opinions on open questions:

Source: Why alignment could be hard with modern deep learning (2021)


Thought Experiments

There are three main anchors for helping to forecast the future in ML:

The third anchor is often overlooked by engineers as they are used to extrapolating current trends but this has not worked well for emergent behaviour in LLMs.

Source: Thought Experiments Provide a Third Anchor (2022)


Thought experiment on deceptive alignment. An agent is a perfect optimizer for an intrinsic reward R but is being trained for an extrinsic reward R*. It is more optimal for the agent to just pretend it is aligned with the extrinsic reward by using action a* at train time so that at deployment it can just output a^R which is great for the goal the humans didn’t want. If the agent acts myopic at train time (choosing a^R), then gradient decent will change the weights away from the agent being good at the intrinsic reward. The behaviour shifts significantly between train and test time even in the absence of distributional shift.

Currently this isn’t how ML systems work so it seems implausible but currently they are also bad at long term planning. If emergence brings this it might be more likely to be deceptively aligned.

Reasons to think this wont happen:

Source: ML Systems Will Have Weird Failure Modes (2022)


Influence seeking behaviour

It seems plausible that agents would get influence seeking behaviour by default because this behaviour will score well according to training objective ( “slightly increasing influence” is probably just as good as “slightly increasing understanding of goal”).

This behaviour would be difficult to root out because the agent will be gaming all of this behaviour too if it is more intelligent than humans to become undetectable. Patterns like this may be more likely in a complex web of agents rather than a single computer since it gives more opportunity for influence to spread.

Early in the trajectory to failure, the influence seeking agents will seem useful, aligned and keeping people happy. Occasionally they may fail catastrophically for instance an automated company just taking all the money and running. Enough of these happening across the complex web of AIs, may cause a distributional shift that causes us to completely lose control as the other AIs take advantage of the chaos. This could be described as “going out with a bang” rather than a whimper which was proposed in the first part of this blog (where AIs have sophisticated manipulation for misaligned proxies for what humans want).

Source: What failure looks like (2019)