Argument for AI x-risk from effective malign agents

This page is incomplete, under active work and may be updated soon.

This is a summary of a commonly cited argument for existential risk from superhuman artificial intelligence (AI): that advanced AI systems will tend to pursue goals whose satisfaction would be devastatingly bad by the lights of any human, and that these AI systems will have the competence to achieve those goals. The argument appears to be suggestive but not watertight.


This seems to be one of the most common arguments for expecting superhuman artificial intelligence to bring about existential risk if built. It has been described or discussed in numerous works, as well as in many private writings and conversations. This is an attempt to lay out a strong form of the argument, based on what appears to be generally intended, and drawing on large parts of the aforementioned discussion.



  1. Superhuman AI: humanity will at some point develop AI systems at least as capable as any human at approximately all tasks, and substantially better at some tasks—call this ‘superhuman AI’.
  2. Inaction: no further special action will be taken to mitigate existential risk from superhuman AI systems. (This argument is about the default scenario without such efforts, because it is intended to inform decisions about applying these efforts, not because such efforts are unlikely.)

I. If superhuman AI is developed, then at least some superhuman AI systems are likely to be goal-directed

(Main article: Will advanced AI be agentic?)

That is, some such systems will systematically choose actions that increase the probability of some states of the world over others1). Reasons to expect that some superhuman AI systems will be goal-directed include:

  1. Some goal-directed behavior is likely to be economically valuable to create (i.e. also not replaceable using only non-goal-directed systems). This appears to be true even for apparently x-risky systems, and will likely appear true more often than it is.
  2. Goal-directed entities may tend to arise from machine learning training processes not intending to create them.
  3. Coherence arguments may imply that systems with some goal-directedness will become more strongly goal-directed over time.

II. If goal-directed superhuman AI systems are built, their preferred outcomes will probably be as bad as human extinction

(Main article: Will AI agent values be bad by default?)

Reasons to expect this include:

  1. Finding goals that aren’t extinction-level bad and are relatively useful appears to be hard: we don’t have a description of the goals of any human or group of humans, or a practical way to produce one, and divergences from human goals seem likely to quickly produce goals that are in intense conflict with human goals, due most goals producing convergent incentives for knowledge, zealous global reach and for power-seeking. There are exceptions where we see ways to define goals that produce very constrained behavior, but this is unlikely to be valuable, so unlikely to be produced.
  2. Finding goals that are extinction-level bad and relatively useful appears to be easy: for example, advanced AI with the sole objective ‘increase revenue’ might be highly valuable to for a time, but risks longer term harms to society, if powerfully accruing resources and power toward this end with no regard for ethics beyond laws that are still too expensive to break.
  3. Giving a powerful AI system any specific goals appears to be hard, both because we don’t know of any procedure to do it, and we have theoretical reasons to expect that AI systems produced through machine learning training will generally end up with different goals to those that they were trained according to.

Thus we might expect that successfully making advanced AI with acceptable goals will remain out of reach, while AI systems with x-risky goals are pursued, and the AI systems created as a result have different goals again, which are also x-risky and perhaps more costly on other fronts.

Even if the above difficulties were overcome—i.e. if advanced AI systems could easily be given desirable goals—controlling the goals of a population of powerful AI systems presents further serious difficulties, because different parties can create AI systems with different goals, and selection processes from differences in the success of different systems can control the distribution of systems. There is no guarantee that safer systems will be cheaper, more profitable, or more prone to uptake; more likely the opposite will be true.

III. If most superhuman AI systems have bad goals, the future will very likely be bad

That is, a population of superhuman AI systems would sufficiently control the future to either satisfy their own goals, or at least wrest control over the future from humans and bring about a future that nobody wants. This is supported by at least one of the following being true:

  1. A superhuman AI system would destroy humanity rapidly, via means available either through already having sufficient intelligence to design very powerful technologies and conduct very sophisticated strategies, or through gaining such intelligence in a rapid ‘intelligence explosion‘, or self-improvement cycle. Such a system would have reason to destroy humanity because humans are (by assumption here) opposed to the AI’s goals and therefore a threat to their realization. Destruction of humanity would prevent human goals being satisfied, because they are idiosyncratic and not shared by the prevailing AI. This branch may be true if such an AI system was or became radically more capable than humanity in general, or if there were highly destructive technologies available to minds slightly more intelligent than humans in certain respects.
  2. Superhuman AI systems will gradually come to control the future via accruing power and resources, which would be more available to the AI systems than to humans on average, because of the AI systems’ greater intelligence.

Conclusion: The creation of superhuman AI is likely to bring about an extinction-level bad future, by default

If I, II, and III are true, we thus have: assuming humanity develops superhuman AI systems, then by default some such systems will have goals, those goals will be extinction-level bad, and will likely be achieved. Thus if superhuman AI systems are developed, the future will likely be extinction-level bad by default.

It is not clear how likely these premises are. Each appears to have substantial probability, so the conclusion overall appears to have non-negligible probability.

Counterarguments and open questions

(Main article: Arguments against AI posing an x-risk, or see pages for individual argument clauses above)

  1. Whether the argument goes through is a quantitative question, not an analytical one. It matters how much poorly motivated AI there is, how its goals differ from ours, how much smarter it is than others, how easily and cheaply the systems can cause destruction, and assuming that goal-directedness is a continuum, how much is needed before we have a problem.
  2. For comparing the competence of advanced AI systems to humans, the relevant comparison is with humans augmented with state-of-the-art technology, including other AI systems, agentic or non-agentic. It is not clear that the AI advantage is knowably large for relevant tasks, given that. (Counterargument to III.1, and possibly III.2)
  3. It is not clear that humans (or technologically augmented humans) are far from the limits to competence in strategically relevant domains, as would be needed for AI systems to be materially superior. How much headroom is available depends on the task: machines cannot outperform humans at Tic-tac-toe (playing within the rules), and even in less contrived tasks, humans might already be reaping most of the value—for instance, if most of the value of forks is in having a handle with prongs attached to the end, and while humans continue to design slightly better ones, and machines might be able to add marginal value to the more than twice as fast as the human designers, they cannot perform twice as well in producing forks because there isn’t enough headroom. For particular strategically relevant tasks, it is to our knowledge an open question how much headroom there is. (Counterargument to III.1. and III.2.)
  4. It is unclear that the ‘goal-directedness’ that is economically valuable, or suggested by other arguments under I, is the same goal-directedness that can be predicted to zealously optimize for an objective. It is possible to create non-zealous yet weakly agentic creatures—for instance, horses. Also, horses are plausibly much more economically valuable than a version of horses that zealously optimized for something. Which is not to say that the zealous goal-directedness couldn’t be built, but that it may be either easily avoidable (or even quite hard to build). Goal-directedness is a vague and poorly understood concept, so plausibly there is a large space of behavior that is economically valuable in the relevant way, but not zealously seeking a particular universe state.
  5. Some divergence from ideal values probably implies a catastrophic outcome, but it is unclear how large that divergence is, and whether it is quantitatively close to the divergence that is likely between a person’s values and the values of an AI trained to match that person. Humans are already varied in their goals, and perhaps while being imperfect, we can get AI well within that circle of variation. There is also a circle of ‘what can be aligned with incentives in the environment’, which is arguably larger than the circle containing most human values.
  6. Intelligence is helpful for accruing power and resources, all things equal, but many other things are helpful too, and AI systems do not have those in abundance necessarily. The above argument assumes that any difference in intelligence in particular will eventually win out over any differences in other initial resources. We don’t know why this would be the case.
  7. This argument also appears to apply to human groups such as corporations, so we need an explanation of why those are not an existential risk.


If this argument is successful, then in conjunction with the view that superhuman AI will be developed, it implies that humanity faces a large risk from artificial intelligence. This is evidence that it is a problem worthy of receiving resources, though this depends on the tractability of improving the situation (which depends in turn on the timing of the problem), and on what other problems exist, none of which we have addressed here.

Primary author

Katja Grace

‘Goal-directed’ suggests pursuit of one particular outcome, but note that the term here also refers to having preferences over every choice of states of affairs, and acting to increase the chances of higher-ranked outcomes, without a particular focus on the top-ranked one.

This is intended to be weaker than the claim that such systems will be ‘agents’ with consistent utility functions, for instance also including systems such as humans, who appear to be inconsistent (for example, see the Allais Paradox) but still systematically bring about certain outcomes on net, across a range of situations.
arguments_for_ai_risk/is_ai_an_existential_threat_to_humanity/will_malign_ai_agents_control_the_future/argument_for_ai_x-risk_from_competent_malign_agents/start.txt · Last modified: 2023/09/26 07:22 by katjagrace