John Hughes - Scalable oversight

Week 4 - Task decomposition for scalable oversight

Scalable oversight

Scalable oversight is the problem of being able to supervise systems that are more capable or intelligent than us. If we want to deploy AGI safely, then we will need robust approaches to provide labels, reward signals or critiques to remain effective. We need ways to get trustworthy signals from potentially deceptive agents.

There is a “sandwiching proposal” where non-experts attempt to train an AI model on a task where the language model is already more capable than them. Domain experts then participate at the end to evaluate if the non-expert has succeeded. The experts catch mistakes so we can learn from them. An example would be a machine learning engineer using GPT-3 to help medically diagnose someone and a doctor overseeing it.

The study shows that model assisted humans outperform machines by 10% and exceed their own capability by 36%.

Source: Measuring Progress on Scalable Oversight for Large Language Models (2022)

Task Decomposition

Task decomposition is the process of breaking down a complex task into many subtasks that are easier to solve. In addition, it usually makes it easier for humans to evaluate the output of AI systems. For example, OpenAI applied a hierarchy of summarisations to books to allow humans to effectively judge summaries of long books which take a long time to read compared to short articles.

“Our current approach to this problem is to empower humans to evaluate machine learning model outputs using assistance from other models. In this case, to evaluate book summaries we empower humans with individual chapter summaries written by our model, which saves them time when evaluating these summaries relative to reading the source text. Our progress on book summarization is the first large-scale empirical work on scaling alignment techniques.”

Source: Summarizing Books with Human Feedback (2021)

Iterated Amplification

Iterated amplification is a training method that breaks down complex tasks into smaller subtasks that can be done by humans, and uses the solutions to these subtasks to train AI systems to solve increasingly larger tasks. The model learns to break down the question and answer the subquestions. The method assumes that humans can identify the smaller components of a task and can perform simple instances of the task. The iterative process builds up a training signal from scratch, resulting in a fully automated system that can solve highly composite tasks despite starting with no direct training signal.

It could be used to create aligned AI by iteratively distilling and scaling up AI systems that are safe and aligned with human values. The method involves starting with a weak and safe AI and giving it access to other weak and safe AIs to enable it to solve more complex tasks. The AI is scaled up iteratively, ensuring it remains safe and aligned with human values at each stage. The final result would be a powerful AI that is safe, competitive, and aligned with human values.

“The key assumption underlying Iterated Amplification is that a human can coordinate multiple copies of X to perform better than a single copy of X. As an example, consider the problem of evaluating a proposed design for a transit system. Rather than forcing a single copy of X to reach a snap judgment about a proposed design, we can have copies of X evaluate many different considerations (estimating costs, evaluating how well the system serves different populations, and so on). A human can then decide how to aggregate those different considerations (potentially with help from further copies of X).”

Source: Learning Complex Goals with Iterated Amplification (2018), Supervising strong learners by amplifying weak experts (2018)

Solving problems harder than demonstrations

Least to most prompting of LLMs is an interesting approach that uses task decomposition and chain-of-thought reasoning to solve hard problems that chain of thought alone would not be able to solve.

Chain of thought prompting is when you ask the language model to provide its reasoning step-by-step before giving the final answer. Structuring the prompt with few-shot examples helps further.

One interesting finding is that the mistakes the LLM makes are usually silly errors that are similar to careless mistakes humans often make. I wonder if there is prompt engineering that can enforce this never to happen…

Source: Language Models Perform Reasoning via Chain of Thought (2022), Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (2022)