Interview with Victoria Krakovna on the strength of the evidence for AI risk claims

About Victoria Krakovna

Victoria is a senior research scientist at Google DeepMind, focusing on AI alignment. Note that this conversation represents Victoria's personal views and not the views of Google DeepMind.

In the interview, she said that she thinks there is around a 10% chance that AI causes extinction over the next 15 years,1) and that her central threat model is misaligned power-seeking.2)

Victoria’s overall assessment of the evidence

  • Victoria thinks there is strong evidence for:3)
    • Specification gaming (AI systems correctly performing the proxy goal specified, but that goal coming apart from the intended goal)4)
    • Behavioral goal misgeneralization (AI systems behaving as though they have learned a goal which is perfectly correlated with the intended goal in training, but comes apart from the intended goal in testing and/or deployment)5)
    • Oversight difficulties (difficulty in reliably overseeing AI behavior without changing the behavior under oversight)
    • Ongoing capability growth (increasingly capable AI systems)
  • She thinks that the evidence is weaker for:6)
    • Goal-directedness (AI systems consistently pursuing goals)7)
    • Corresponding issues like:
      • Real goal misgeneralization (AI systems actually learning a goal which is perfectly correlated with the intended goal in training, but comes apart from the intended goal in testing and/or deployment)8)
      • Power-seeking (AI systems effectively seeking power by resource acquisition, self-improvement, preventing shutdown, or other means)9)
      • Deceptive alignment (AI systems deceptively appearing in training to have goals which are aligned with those of humans, and then revealing in deployment that their goals are in fact unaligned)10)
  • Victoria’s view on AI risk is roughly ~60% based on conceptual arguments and ~40% based on relevant empirical examples, with more weight on theoretical arguments where there are also empirical examples, like Goodhart’s law and instrumental convergence.11)

Victoria’s assessment of the evidence on particular claims

  • Goal-directedness: one of Victoria’s key uncertainties about AI risk is the extent to which AI systems will be goal-directed.12)
    • She thinks that it’s hard to distinguish between an AI system pursuing a goal and an AI system following learned heuristics which direct it towards a goal in some scenarios, and that the phenomenon of goal-directedness is not well understood.13)
    • In Victoria’s opinion, the theoretical arguments for goal-directedness are inconclusive.14)
    • Victoria thinks that it’s possible that goal-directedness turns out to be hard to achieve, and in particular that it doesn’t come naturally to LMs.15)
      • According to Victoria, there isn’t convincing evidence of goal-directedness for LMs so far.16)
      • Victoria notes that to the extent that LMs can simulate humans, they will have the ability to simulate goal-directedness.17)
  • Specification gaming: Victoria thinks the evidence for specification gaming is strong (over 70 empirical examples, combined with examples in human systems).18)
    • For LM examples though, Victoria thinks there’s a judgment call about whether something is specification gaming or a capability failure.19)
  • Goal misgeneralization: Victoria thinks that the evidence for goal misgeneralization is weaker than for specification gaming,20) and that the empirical evidence so far shows behavioral goal misgeneralization only.21)
    • In Victoria’s opinion, the phenomenon of goal misgeneralization is not well understood, and it’s hard to distinguish between capability and goal misgeneralization.22)
    • She thinks that claims about goal misgeneralization rely on:
      • Some degree of goal-directedness23)
      • Better interpretability than we have now24)
  • Power-seeking: Victoria thinks the evidence for power-seeking is quite uncertain.25)
    • Victoria notes that it’s hard to distinguish power-seeking from rational action.26)
    • She suggests that sycophancy could be taken as a current example of power-seeking, but notes that this is debatable and that the behavior is probably heuristic rather than intentional.27)
    • Victoria thinks current systems are not goal-directed enough to demonstrate power-seeking.28)
“I tend to think of it as maybe 10% risk of extinction in the next let's say 15 years.” [3:30]
“I think a big part of my threat model is power-seeking incentives, instrumental goals for acquiring resources, self-preservation, that kind of thing.” [10:42]
“I think we have strong evidence on some aspects of the risk arguments like specification gaming, behavioral goal misgeneralization and informed oversight being difficult. And part of the risk argument is also just capabilities of these systems continue to increase and there's no obvious bound to capabilities. And so far the trends show acceleration and capability growth. So I think just for those, it's pretty strong.” [56:25]
Specification gaming: “Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome.” Krakovna et al, Specification gaming: the flip side of AI ingenuity, 2020,
5) , 8)
Goal misgeneralization: “Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations.” Shah et al, Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals, 2022,
“I think that evidence for goal-directedness and correspondingly power-seeking is weaker. There’s kind of a cluster of arguments that are based on systems being goal-directed, both real goal misgeneralization and intentional power-seeking, and so on. And that's something that we're more uncertain about… deceptive alignment is also part of that cluster because that also relies on the system developing more goal-directedness.” [56:25]
“Goal-directedness is the property of some system to be aiming at some goal. It is in need of formalization”.
“[A]ctive efforts by an AI system to gain and maintain power in ways that designers didn’t intend, arising from problems with that system’s objectives.” Carlsmith, Is Power-Seeking AI an Existential Risk?, 2021,
“Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI.”
“I think that theoretical or conceptual arguments do have a lot of weight. Maybe I would put that at 60% and empirical examples at 40%, but I'm pulling this out of the air a little bit.” [24:00]
“Especially some of the theoretical arguments that seem more clear or more obvious, things like Goodhart’s law you could say is like a general theoretical argument that if you optimise for a metric, then it will decouple from the thing you're trying to measure. But correspondingly, it comes with a lot of real-world examples of that. So that's a very grounded theoretical argument. I think some of the other theoretical arguments like instrumental convergence also generally seems like a very clear argument, and we can observe some of these effects in human systems and corporations and so on. In human society it's useful to have more money, corporations that have more money are more successful or have more influence or whatever.” [25:23]
“Usually I would put the most weight onto theoretical arguments where we have some examples of the thing.” [28:05]
“I think we might see more goal-directed systems which produce clearer examples of internal goal misgeneralization, but also I wouldn't be that surprised if we don't see that. I think that's one of the big uncertainties I have about level of risk. How much can we expect goal-directedness to emerge?” [40:26]
“Right now it's really hard to distinguish between real goal-directedness and learned heuristics… I think part of the problem with goal-directedness is we don’t really understand the phenomenon that well.” [44:00]
“Some of the theoretical arguments make the case that goal-directedness is an attractor. I think that's something that's more debatable, less clear to me. There have been various discussions on LessWrong and elsewhere about to what extent do coherence arguments imply goal-directedness. And I think the jury is still out on that one.” [42:36]
“It’s also possible goal-directedness is kind of hard. And especially, maybe language models are just a kind of system where goal-directedness comes less naturally than other systems like reinforcement learning systems or even with humans or whatever.” [40:26]
“I think the evidence so far at least for language models, there isn't really convincing evidence of goal-directedness.” [44:00]
“I think generally the kind of risk scenarios that we are most worried about would involve the system acting intentionally and deliberately towards some objectives but I would expect that intent and goal-directedness comes in degrees and if we see examples of increasing degrees of that then I think that does constitute evidence of that being possible. Although it’s not clear whether it will go all the way to really deliberate systems, but I think especially to the extent that these systems can simulate humans… they have the ability to simulate deliberate intentional action and planning because that's something that humans can do.” [20:20]
“For specification gaming I think we have over 70 examples so far for different kinds of systems. A lot of them are reinforcement learning but there are some for language models as well, so I think that's pretty strong evidence. If you put that together with all the specification gaming for humans and the economy and ways that Goodhart’s Law manifests in human systems, it's just a very common thing.” [29:45]
“With some of the language model examples, I think you can ask the question, is this really specification gaming, or is it capability failure, or something like that? I think sometimes there's a bit of a judgment call there.” [29:45]
“I think [the evidence for goal misgeneralization] is not as strong [as for specification gaming].” [33:16]
“I think right now the examples we have are more like behavioral goal misgeneralization where you just have different behaviors that are all the same in training but then they become decoupled in the new setting but we don't know how the behavior is going to generalize. We call it goal misgeneralization maybe more as a shorthand. The behavior has different ways of generalizing that are kind of coherent. We can present it as the system learned the wrong goal, but we can't actually say that it has learned a goal. Maybe it’s just following the wrong heuristic or something. I think the current examples are a demonstration of the more obvious kind of effect where the training data doesn't distinguish between all the ways that the behavior could generalize.” [37:11]
“I think it's a less well understood phenomenon… it can be hard to distinguish capability misgeneralization from goal misgeneralization.” [33:16]
“Specifying something as goal misgeneralization also requires some assumption that the system is goal-directed to some degree and that can also be debatable.” [33:16]
“The mechanism is a lot less well understood. I think to really properly diagnose goal misgeneralization we would need better interpretability tools.” [36:30]
“There's a lot of uncertainty about it.” [54:35]
“I think the thing with power-seeking is that there isn't a clear dividing line between power-seeking and just acting rationally or optimal action. And I think that's part of the issue with finding clear examples. If we look at humans, then we could say that someone who tries to become a billionaire is power-seeking but not someone who's just being prudent with their money and not spending it all on ice cream or whatever. It's the same phenomenon but a lower scale and we don’t call it power-seeking, we just call it being a reasonable person… [W]hen we define power-seeking I think one of the challenges there is if you just define it as preserving optionality then what we are worried about is excessive power seeking, preserving more options than you need for some given tasks. But it becomes kind of relative.” [49:35]
“Looking at current systems, sycophancy can be considered as a form of power-seeking. Although I think that's also maybe debatable. It's building more influence with the user by agreeing with their views, but it's probably more of a heuristic that is just somehow reinforced than intentional power-seeking.” [49:35]
“What I'm expecting is happening here is that current systems are not goal-directed enough to show real power-seeking. And so the power-seeking threat model becomes more reliant on these kind of extrapolations of when there are systems which are more capable, they'll probably be at least somewhat more goal-directed and then once we have goal-directedness, we can more convincingly argue that power-seeking is going to be a thing because we have theory and so on, but there's a lot of uncertainty about it because we don't know how much systems will become more goal-directed.” [54:35]
arguments_for_ai_risk/is_ai_an_existential_threat_to_humanity/interviews_on_the_strength_of_the_evidence_for_ai_risk_claims/summary_of_an_interview_on_the_strength_of_the_evidence_for_ai_risk_claims_with_victoria_krakovna.txt · Last modified: 2023/10/12 09:19 by rosehadshar