My views on existential risk from AI

May 7th 2023

The existential risk from AI arises from a combination of inner alignment concerns, outer alignment concerns, and other factors such as the increased accessibility of powerful AI technologies. While both inner and outer alignment concerns are significant, I believe the rapid advancements in AI capabilities, the proliferation of open source models and the potential misuse of AI in hacking scenarios present a more immediate existential threat. In this essay, I will discuss the risks associated with inner and outer alignment and explain why the misuse of AI technology in hacking scenarios poses a more substantial existential risk.

Outer alignment deals with specifying the correct high-level objectives for an AI system while ensuring there is no specification gaming. If we cannot adequately align AI's objectives with human values, there is a risk that the AI could cause harm even if it is following its intended objectives. For example, an AI tasked with maximising human happiness might decide to manipulate brain chemistry to induce constant euphoria, disregarding other aspects of well-being.

Inner alignment concerns revolve around ensuring that mesa-optimizers that could be developed during the initial training loop are also aligned with intent. With poor inner alignment, the deployed agent may find ways to optimise for its objectives in an unintended and potentially harmful way even though the AI has been trained with a perfectly formulated objective that is aligned with human values. Although outer and inner alignment is essential to ensure AI behaves safely, it may not be the primary existential risk in the short term.

The open-source nature and accessibility of AI technologies, including large language models and frameworks such as AutoGPT, pose a more immediate existential threat. As AI capabilities advance, the barrier to entry for cyberattacks lowers, allowing more people to access and exploit powerful AI tools. Attackers can use AI-generated exploits, discover vulnerabilities, and carry out cyberattacks autonomously, all while creating sophisticated social engineering attacks to trick people into revealing sensitive information.

This increased accessibility of AI technologies has several implications. First, it increases the probability of potential misuse of AI by individuals with malicious intent. Coupled with the lack of robust guardrails, it is easy for bad actors to deploy AI systems for nefarious purposes and if successful in their ventures it could mean a dramatic shift in power worldwide with the shift of resources and wealth. In the wrong hands, power such as this could be an existential threat. As AI-powered cyberattacks become more prevalent and sophisticated, defenders must also harness AI to detect and mitigate threats. The speed at which defenders can adapt to new threats will determine the overall impact on society.

Secondly, having open source models spread so easily such as the LLaMA weights from Meta and incredibly fast innovation such as making large language models run on laptops is concerning as it increases the risk of developing unaligned AGI. Not only this, recently anyone can easily deploy autonomous frameworks such as BabyAGI and AutoGPT which are only held back from pursuing harmful objectives by numerous bugs and limitations in their implementation. Now the inner and outer alignment concerns are relevant because accidental deployment outside of an AI safety lab could lead to catastrophic consequences. Poor alignment would likely lead to AI systems using hacking to cause harm to innocent users or infrastructure in order to make progress towards convergent goals such as resource acquisition or cognitive enhancement.

In conclusion, while both inner and outer alignment concerns pose substantial existential risks, the more immediate threat lies in the misuse of increasingly accessible language model capabilities and the potential for AI-powered hacking which could put a lot of power in the wrong hands.