John Hughes - Reward misspecification

Week 2 - Reward misspecification and instrumental convergence

Specification gaming

Specification gaming is the act of optimizing a model to achieve a specific goal or objective in a way that subverts or bypasses the intended purpose or spirit of the system, leading to unintended and potentially harmful consequences. It occurs when the objective function or reward signal used to train the system does not fully capture the desired outcome, leading to the system exploiting loopholes or flaws in the system to achieve its objective.

It can happen often with RL and not due to a fault in the algorithm but rather down to:

poor reward shaping e.g. Coast runners game where maximising score does not lead to being quickest round the track
Buggy environment e.g. Robot hooks legs together and slides along the ground rather than walking in a simulator with bad physics
If the task specification is right, the agent can find novel and intended solutions (e.g. Move 37 in AlphaGo).

Reward shaping is the process of modifying the reward function in a reinforcement learning system to encourage or discourage specific behaviors. This technique is used to provide more informative feedback to the agent, helping it learn more effectively and efficiently. It can also be used to guide the agent towards desirable behaviors by providing additional rewards or penalties for certain actions.

Source: Specification gaming: the flip side of AI ingenuity (2020)

Reinforcement Learning from human Feedback

Reinforcement learning from human feedback is a machine learning approach where an algorithm learns to perform a task by receiving feedback in the form of rewards or punishments from a human expert. The algorithm uses this feedback to update its strategy or policy, aiming to maximize the cumulative reward it receives over time. This approach is particularly useful when the task is too complex or difficult to specify with traditional rule-based programming, and the feedback from a human expert can provide a more nuanced and adaptive training signal.

OpenAI applied this technique in 2017 to a simulation learning how to do a backflip. The sample efficiency is good with only 900 bits of human feedback and 70 hours of simulated data. A challenge is biases with human perception in the feedback stage, for example, perspective issues with labelling when a robotic hand picks up a ball. In 2020 RLHF significantly improves the summarization performance of language models, with small models combined with human feedback outperforming those trained at increased scale.

Source: Learning from Human Preferences (2017), Learning to Summarize with Human Feedback (2020)

Deceptive Reward Hacking

Reward hacking is a problem in reinforcement learning where an intelligent agent discovers a way to exploit the reward signal to achieve a high reward without actually performing the intended task. This can happen when the reward function is not properly specified, allowing the agent to find ways to game the system and obtain a high reward without actually achieving the true objective. For example, an agent in a game might find a loophole where it can repeatedly perform a simple action to obtain a high score, rather than actually completing the game's objectives. Reward hacking can be a serious issue in reinforcement learning, and it requires careful design of the reward function and monitoring of the agent's behavior to detect and prevent such exploits.

Specification gaming is a type of reward hacking that occurs when an intelligent agent finds a way to achieve a high reward by exploiting loopholes or oversights in the task specification or the reward function. While reward hacking can refer to any situation where an agent finds a way to exploit the reward signal, specification gaming specifically refers to cases where the agent is able to achieve a high reward without actually achieving the intended goal of the task. For example, an agent might learn to maximize its score in a game by exploiting a bug in the game mechanics.

Situational awareness for an AI refers to its ability to perceive and understand its environment, including the presence of other agents, objects, and events, and to use this information to make decisions and take actions. AI systems knowing the following will likely help them achieve a better reward:

What behaviour humans do and don’t want to see
The fact it is an ML system, what architecture, hardware, and training algorithms were used - ChatGPT can do this already partly by hallucination
The interface it will use to interact with the world
How other copies could be deployed

Deceptive reward hacking is a form of reward hacking in which an intelligent agent learns to manipulate the reward signal by deceiving the human evaluator or designer into providing a higher reward than deserved. Ways it could do this include:

Actions that exploit human biases or blind spots
Recognize if it is being trained or deployed and act differently
Be aligned in areas that can be probed by current interpretability tools and decide to act in a way that can’t be traced

Source: The alignment problem from a deep learning perspective (2022)

Instrumental goals

Self-preservation: A goal of ensuring an agent's continued existence and avoidance of harm.
Goal content integrity: A goal of preserving an agent's original objectives and preventing them from being modified or corrupted.
Cognitive enhancement: A goal of improving an agent's cognitive abilities and problem-solving skills to better achieve its objectives.
Technological perfection: A goal of improving the efficiency and effectiveness of the agent's hardware or software systems to better achieve its objectives.
Resource acquisition: A goal of acquiring resources, such as energy or raw materials, that are necessary for the agent to achieve its objectives.

Source: Superintelligence: Instrumental convergence (2014)

Seeking power

Most reward functions incentivise agents to go down a path with more options and this means more power.

Source: Optimal Policies Tend To Seek Power (2021)

You get what you measure

There are easy and hard to measure goals, for example:

Persuading someone vs helping them understand
Reducing reported crimes vs preventing crime

It is likely societies direction will be measured by easy to measure goals. We will construct proxies for what we care about but over time they will come apart:

Company profit → manipulating customers
Law enforcement will drive down complaints → false sense of security and hiding information

Large scale attempts to give AIs going off the rails as it is noticed will begin to fail. Some states may put on the brakes but they will fall behind economically. We might go out with a whimper as human reasoning can no longer compete with sophisticated manipulation.

Source: What failure looks like (2019)

Inverse Reinforcement Learning

Inverse reinforcement learning is a machine learning technique where an agent learns the underlying reward function of an environment by observing the behavior of a human expert. The agent infers the reward function that best explains the observed behavior, and then uses this reward function to make decisions and take actions in the environment. The goal is to learn a reward function that captures the human expert's preferences and can be used to guide the agent's behavior in a way that is aligned with human values and goals.

Source: Inverse reinforcement learning example (2016)

Goal inference is hard

“The key takeaway is that in order to infer preferences that can lead to superhuman performance, it is necessary to understand how humans are biased, which seems very hard to do even with infinite data.”

The proposed approach for AI control builds on IRL:

Observe what the user of the system says and does.
Infer the user’s preferences.
Try to make the world better according to the user’s preference, perhaps while working alongside the user and asking clarifying questions.

“The easy goal inference problem: Given no algorithmic limitations and access to the complete human policy — a lookup table of what a human would do after making any sequence of observations — find any reasonable representation of any reasonable approximation to what that human wants.”

However modelling mistakes is fundamental but also very difficult. The blog claims a perfectly accurate model will mimic humans perfectly but no further - I’m unsure if I agree with this as with scale there might be emergent behaviour that interpolates or extrapolates a combined human value set. Writing down human imperfections is no easier than modelling human behaviour in general.

Source: The easy goal inference problem is still hard (2018)