Interview with Victoria Krakovna on the strength of the evidence for AI risk claims
About Victoria Krakovna
Victoria is a senior research scientist at Google DeepMind, focusing on AI alignment. Note that this conversation represents Victoria's personal views and not the views of Google DeepMind.
In the interview, she said that she thinks there is around a 10% chance that AI causes extinction over the next 15 years,1) and that her central threat model is misaligned power-seeking.2)
Victoria’s overall assessment of the evidence
Victoria thinks there is strong evidence for:
3)
She thinks that the evidence is weaker for:
6)
Victoria’s view on AI risk is roughly ~60% based on conceptual arguments and ~40% based on relevant empirical examples, with more weight on theoretical arguments where there are also empirical examples, like Goodhart’s law and instrumental convergence.
11)
Victoria’s assessment of the evidence on particular claims
Goal-directedness: one of Victoria’s key uncertainties about AI risk is the extent to which AI systems will be goal-directed.
12)
She thinks that it’s hard to distinguish between an AI system pursuing a goal and an AI system following learned heuristics which direct it towards a goal in some scenarios, and that the phenomenon of goal-directedness is not well understood.
13)
In Victoria’s opinion, the theoretical arguments for goal-directedness are inconclusive.
14)
Victoria thinks that it’s possible that goal-directedness turns out to be hard to achieve, and in particular that it doesn’t come naturally to LMs.
15)
According to Victoria, there isn’t convincing evidence of goal-directedness for LMs so far.
16)
Victoria notes that to the extent that LMs can simulate humans, they will have the ability to simulate goal-directedness.
17)
Specification gaming: Victoria thinks the evidence for specification gaming is strong (over 70 empirical examples, combined with examples in human systems).
18)
For LM examples though, Victoria thinks there’s a judgment call about whether something is specification gaming or a capability failure.
19)
Goal misgeneralization: Victoria thinks that the evidence for goal misgeneralization is weaker than for specification gaming,
20) and that the empirical evidence so far shows behavioral goal misgeneralization only.
21)
Power-seeking: Victoria thinks the evidence for power-seeking is quite uncertain.
25)
Victoria notes that it’s hard to distinguish power-seeking from rational action.
26)
She suggests that sycophancy could be taken as a current example of power-seeking, but notes that this is debatable and that the behavior is probably heuristic rather than intentional.
27)
Victoria thinks current systems are not goal-directed enough to demonstrate power-seeking.
28)