Week 2 - Reward misspecification and instrumental convergence

Specification gaming

Specification gaming is the act of optimizing a model to achieve a specific goal or objective in a way that subverts or bypasses the intended purpose or spirit of the system, leading to unintended and potentially harmful consequences. It occurs when the objective function or reward signal used to train the system does not fully capture the desired outcome, leading to the system exploiting loopholes or flaws in the system to achieve its objective. 


It can happen often with RL and not due to a fault in the algorithm but rather down to:


Reward shaping is the process of modifying the reward function in a reinforcement learning system to encourage or discourage specific behaviors. This technique is used to provide more informative feedback to the agent, helping it learn more effectively and efficiently. It can also be used to guide the agent towards desirable behaviors by providing additional rewards or penalties for certain actions.


Source: Specification gaming: the flip side of AI ingenuity (2020)



Reinforcement Learning from human Feedback

Reinforcement learning from human feedback is a machine learning approach where an algorithm learns to perform a task by receiving feedback in the form of rewards or punishments from a human expert. The algorithm uses this feedback to update its strategy or policy, aiming to maximize the cumulative reward it receives over time. This approach is particularly useful when the task is too complex or difficult to specify with traditional rule-based programming, and the feedback from a human expert can provide a more nuanced and adaptive training signal.


OpenAI applied this technique in 2017 to a simulation learning how to do a backflip. The sample efficiency is good with only 900 bits of human feedback and 70 hours of simulated data. A challenge is biases with human perception in the feedback stage, for example, perspective issues with labelling when a robotic hand picks up a ball. In 2020 RLHF significantly improves the summarization performance of language models, with small models combined with human feedback outperforming those trained at increased scale.


Source: Learning from Human Preferences (2017), Learning to Summarize with Human Feedback (2020)



Deceptive Reward Hacking

Reward hacking is a problem in reinforcement learning where an intelligent agent discovers a way to exploit the reward signal to achieve a high reward without actually performing the intended task. This can happen when the reward function is not properly specified, allowing the agent to find ways to game the system and obtain a high reward without actually achieving the true objective. For example, an agent in a game might find a loophole where it can repeatedly perform a simple action to obtain a high score, rather than actually completing the game's objectives. Reward hacking can be a serious issue in reinforcement learning, and it requires careful design of the reward function and monitoring of the agent's behavior to detect and prevent such exploits.


Specification gaming is a type of reward hacking that occurs when an intelligent agent finds a way to achieve a high reward by exploiting loopholes or oversights in the task specification or the reward function. While reward hacking can refer to any situation where an agent finds a way to exploit the reward signal, specification gaming specifically refers to cases where the agent is able to achieve a high reward without actually achieving the intended goal of the task. For example, an agent might learn to maximize its score in a game by exploiting a bug in the game mechanics.


Situational awareness for an AI refers to its ability to perceive and understand its environment, including the presence of other agents, objects, and events, and to use this information to make decisions and take actions. AI systems knowing the following will likely help them achieve a better reward:


Deceptive reward hacking is a form of reward hacking in which an intelligent agent learns to manipulate the reward signal by deceiving the human evaluator or designer into providing a higher reward than deserved. Ways it could do this include:


Source: The alignment problem from a deep learning perspective (2022)



Instrumental goals


Source: Superintelligence: Instrumental convergence (2014)



Seeking power

Most reward functions incentivise agents to go down a path with more options and this means more power.


Source: Optimal Policies Tend To Seek Power (2021) 



You get what you measure

There are easy and hard to measure goals, for example:


It is likely societies direction will be measured by easy to measure goals. We will construct proxies for what we care about but over time they will come apart:


Large scale attempts to give AIs going off the rails as it is noticed will begin to fail. Some states may put on the brakes but they will fall behind economically. We might go out with a whimper as human reasoning can no longer compete with sophisticated manipulation.


Source: What failure looks like (2019)



Inverse Reinforcement Learning

Inverse reinforcement learning is a machine learning technique where an agent learns the underlying reward function of an environment by observing the behavior of a human expert. The agent infers the reward function that best explains the observed behavior, and then uses this reward function to make decisions and take actions in the environment. The goal is to learn a reward function that captures the human expert's preferences and can be used to guide the agent's behavior in a way that is aligned with human values and goals.


Source: Inverse reinforcement learning example (2016)



Goal inference is hard

“The key takeaway is that in order to infer preferences that can lead to superhuman performance, it is necessary to understand how humans are biased, which seems very hard to do even with infinite data.”

The proposed approach for AI control builds on IRL:


“The easy goal inference problem: Given no algorithmic limitations and access to the complete human policy — a lookup table of what a human would do after making any sequence of observations — find any reasonable representation of any reasonable approximation to what that human wants.”


However modelling mistakes is fundamental but also very difficult. The blog claims a perfectly accurate model will mimic humans perfectly but no further - I’m unsure if I agree with this as with scale there might be emergent behaviour that interpolates or extrapolates a combined human value set. Writing down human imperfections is no easier than modelling human behaviour in general.


Source: The easy goal inference problem is still hard (2018)