My Thoughts on OpenAI's Alignment Plan

May 5th 2023

I would like to begin by expressing my appreciation for OpenAI's decision to make their plan public, as it allows for community feedback and sets a positive example for other AI labs. At a fundamental level, I agree with the strategy of utilising LLMs to expedite alignment research. However, there are several issues that have not been fully addressed in the plan. These include the potential impact of race dynamics, the risks associated with agentic behaviour, the difficulty of evaluating model performance, detecting dangerous behaviour, and the challenges of controlling simulators.

Race dynamics

My primary concern with OpenAI's plan is the lack of a clear strategy for navigating race dynamics between other AI labs in the pursuit of developing more capable AI systems. It is highly probable that the same research and development efforts that produce AI systems that can “ultimately conceive, implement, study, and develop better alignment techniques” will also result in the ability to use it to advance any research. To assume otherwise would be overly optimistic. Given the immense incentives for OpenAI to maintain a competitive advantage over other AI labs, it is likely that they will utilise their powerful assistants to push the boundaries of AI capabilities as well, despite their emphasis on alignment research. Furthermore, while OpenAI's goal is for “every AGI developer to use the world’s best alignment techniques”, they must be cautious in how they release these assistants to prevent other companies from using them to accelerate capabilities research instead of alignment.

Agentic capabilities are dangerous

OpenAI's plan to "off-load almost all of the cognitive labor" to their AI assistants raises concerns about the potential risks associated with agentic behaviour. One possible approach involves allowing large language models to autonomously search for solutions given a goal, which could lead to unforeseen consequences. Even if the assistants are not trained with traditional RL approaches, there may be other forms of agentic behaviour that could prove dangerous. For instance, AutoGPT uses self-prompting rather than RL where many LLM agents with different tasks try to complete a goal which in the future could eventually be something as complex as “solve AI alignment”. However, I am very concerned about the safety of these frameworks because once the underlying buggy behaviour that often cripples AutoGPT is addressed (which, given the rapid development of open-source projects, may not take long), it is plausible that this self-prompting framework could decide to rewrite source code to adapt its capabilities or exploit software vulnerabilities.

Evaluating the performance and detecting dangerous behaviour

One aspect that the plan fails to adequately address is the evaluation of the output produced by the alignment assistant. What would happen is the output contained hallucinations that were believable even by exports in the field? This could be just as challenging as developing the assistant itself, as there are currently no reliable methods for achieving scalable oversight. Regardless of whether the assistant acts as an agent or not, there needs to be reliable mechanisms for detecting situational awareness and deception, which the plan does not explicitly outline.

Simulators are hard to control

OpenAI asserts that large language models (LLMs) are well-suited to automate alignment research due to their “preloaded knowledge” and information about human values. However, it is worth noting that while LLMs may possess knowledge about human values, this knowledge may be entangled with incorrect or undesirable values. RLHF does a good job on the surface at steering the outputs to be helpful and harmless but it is very easy to break past this layer. As a result, disentangling the correct values from the flawed ones in every possible situation may prove challenging. Achieving this may require complex prompt engineering to simulate a robust set of values consistently but due to the chaotic behaviour of LLMs when viewed as simulators, I believe this will be difficult.