Interview with Jacob Hilton on the strength of the evidence for AI risk claims

About Jacob Hilton

Jacob is a researcher on the theory team at the Alignment Research Center.

Jacob thinks that the probability of AI causing extinction by 2100 is 1-10%.1)

Jacob’s overall assessment of the evidence

  • Jacob is most persuaded by the general argument that AI systems will become very powerful, and puts less weight on specific stories.2)
  • Jacob thinks that the case that AI systems will be very powerful is strong.3) Jacob has several reasons for this, including:4)
    • Analogies with the human brain
    • Theoretical considerations about neural networks being able to learn
    • Empirical evidence that increasing compute increases performance
  • Jacob thinks the evidence for misalignment is much more uncertain.5)

Jacob’s assessment of the evidence on particular claims

  • Goal-directedness (AI systems consistently pursuing goals)6) Jacob thinks that there is a clear trend towards systems acting more autonomously.7)
  • Goal misgeneralization (AI systems actually learning a goal which is perfectly correlated with the intended goal in training, but comes apart from the intended goal in testing and/or deployment):8)
    • Jacob thinks that misgeneralization is currently very common,9) but that it’s unclear how this will play out in more powerful systems.10)
    • Jacob notes that currently there aren’t examples of misgeneralization at very high levels of abstraction (like taking over the world to make a cup of coffee), but that this is to be expected as current systems aren’t capable of reasoning at that level in the first place.11)
    • Jacob thinks there is a trend of misgeneralization happening at increasingly high levels of abstraction.12)
    • Jacob thinks that current generalization failures are easy to explain and it’s easy to imagine mitigating them with more fine-tuning data. Jacob expects this to apply to future generalization failures too to some extent,13) although there are reasons to expect that this might not be enough for future models.
  • Power-seeking (AI systems effectively seeking power by resource acquisition, self-improvement, preventing shutdown, or other means):14) Jacob doesn’t think there’s empirical evidence of power-seeking in existing models so far.15)
  • Misuse (humans intentionally using AIs to cause harm):16) Jacob thinks that only around half of existential risk from AI is coming from misalignment, and that other things like misuse are important too.17)
    • Jacob favors a holistic approach to AI, and thinks that threats which don’t seem directly existential are also important to address.
“My cached number is something like five to ten percent [chance that AI causes an existential catastrophe by 2100]. Maybe that’s on the higher end of the numbers. I normally say one to five percent.” [2:38]
“In general I prefer to think about things holistically. I find the arguments that AI is likely to be very powerful relatively compelling, and that alone is enough to mean that AI as a whole is something people should be thinking carefully about, and existential risks are among the things that people should be worried about. Ultimately specific stories about internal planning within AIs are… illustration by example of why things could be scary. But the world is complicated… There are probably going to be powerful systems doing things, probably acting autonomously. It seems very unpredictable where that goes.” [19:35]
“The case that AI is likely to have a big effect, be relatively powerful, I feel relatively sold on. I think there’s a pretty good case for that.” [47:16]
“I think probably an important set of background views is around the scaling hypothesis, how likely AI is to be very powerful in general. If AI isn’t likely to be a very powerful tool, then it’s probably not very likely to pose existential risk. Briefly, that’s a combination of analogies to the human brain in terms of how plausible is it that systems can be as good as humans at things, and then theoretical considerations about neural networks being able to learn things, and empirical evidence, scaling laws and things like that which suggest that as compute becomes more plentiful, neural nets are better able to approximate various functions that we’re interested in approximating. So that’s background views on why I think it’s relatively likely that it ought to be feasible to get a larger network to do most of the sorts of things humans are doing. Maybe I'm 60 or 70% confident that that's feasible on the time scale of multiple decades.” [5:03]
“The arguments about misalignment risk are definitely more uncertain in that they are doing more extrapolation. Both arguments are doing extrapolation. I think the misalignment stuff is sometimes doing a bit more of a difficult extrapolation, because it’s extrapolating these generalization properties which is just notoriously hard to do. I think that means that the case is just much more uncertain, but the case that the stakes are big is very good.” [47:16]
“Goal-directedness is the property of some system to be aiming at some goal. It is in need of formalization”.
“We are already capable of getting AI systems to do simple things relatively autonomously. I don’t think it’s a threshold where now it’s autonomous, now it’s not… I think it’s a spectrum and it’s just very clearly ramping up. We already have things that have a little autonomy but not very much. I think it's just a pretty straightforward trend at this point.” [24:39]
Goal misgeneralization: “Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations.” Shah et al, Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals, 2022,
“But we do have lots of examples of AIs misgeneralizing at lower levels of abstraction.” [28:36]
“These generalization failures at new levels of abstraction are notoriously hard to predict. You have to try and intuit what an extremely large scale neural net will learn from the training data and in which ways it will generalize… I’m relatively persuaded that misgeneralization will continue to happen at higher levels of abstraction, but whether that actually is well described by some of the typical power-seeking stories I’m much less confident and it’s definitely going to be a judgment call.” [28:36]
“The story of you train an AI to fetch a coffee and then it realizes that the only way it can do that is to take over the world is a story about misgeneralization. And it's happening at a very high level of abstraction. You're using this incredibly intelligent system which is reasoning at a very high level about things and it's making the error at that high level… And I think the state of the evidence is… we've never observed a misgeneralization failure at such a high level of abstraction, but that's what we would expect because we don't have AIs that can even reason at that kind of level of abstraction.” [28:36]
“We have this trend where it seems like as you scale things up, misgeneralization failures do happen at high levels of abstraction.” [28:36]
“[Hilton] I think essentially all of the generalization failures we see today are fairly easy to explain. It’s reasonable to expect that they would go away with an order of magnitude or two orders of magnitude more data that was high quality enough… They’re pretty common but it’s pretty understandable why they are there. The amount of fine-tuning data used for RLHF and supervised fine-tuning for chatbots is incredibly small compared to the amount of pre-training data. It’s unsurprising it will still be generalizing more according to its pre-training objective.
[Hadshar] If current misgeneralization is easy to explain or it’s easy to imagine it being solved by more fine-tuning data, why doesn’t that also apply to potential future misgeneralization at higher levels of abstraction?
[Hilton]: I think it does apply. I think simple mitigation measures will go a long way.”[37:32]
“[A]ctive efforts by an AI system to gain and maintain power in ways that designers didn’t intend, arising from problems with that system’s objectives.” Carlsmith, Is Power-Seeking AI an Existential Risk?, 2021,
“I don’t think there’s really empirical evidence [for power-seeking]… To me it’s very uncertain.” [28:36]
See Hendrycks et al, An overview of catastrophic AI risks (2023) for an introduction to misuse, which they refer to as ‘Malicious use”.
“We talked mostly about misalignment stuff. If I had to put a number on it, I’d say 50/50 between that and everything else. I’m pretty sympathetic to the approach that this is a big social challenge. We should address the problems we see in front of us and that'll be the best way of getting through it. And I think that means focusing on all sorts of other things as well. An obvious one which could be existential is interaction with bio, and misuse in other ways… I think disruption due to unemployment from AI is also something people should be focusing on. That's another thing that's hard to see how it becomes directly existential but I'm sure it interacts with many other things in various different ways.” [48:40]
arguments_for_ai_risk/is_ai_an_existential_threat_to_humanity/interviews_on_the_strength_of_the_evidence_for_ai_risk_claims/summary_of_an_interview_on_the_strength_of_the_evidence_for_ai_risk_claims_with_jacob_hilton.txt · Last modified: 2023/10/12 09:19 by rosehadshar