arguments_for_ai_risk:is_ai_an_existential_threat_to_humanity:interviews_on_the_strength_of_the_evidence_for_ai_risk_claims:summary_of_an_interview_on_the_strength_of_the_evidence_for_ai_risk_claims_with_jacob_hilton
Interview with Jacob Hilton on the strength of the evidence for AI risk claims
About Jacob Hilton
Jacob is a researcher on the theory team at the Alignment Research Center.
Jacob thinks that the probability of AI causing extinction by 2100 is 1-10%.1)
Jacob’s overall assessment of the evidence
Jacob is most persuaded by the general argument that AI systems will become very powerful, and puts less weight on specific stories.
2)
Jacob thinks that the case that AI systems will be very powerful is strong.
3) Jacob has several reasons for this, including:
4)
Analogies with the human brain
Theoretical considerations about neural networks being able to learn
Empirical evidence that increasing compute increases performance
Jacob thinks the evidence for misalignment is much more uncertain.
5)
Jacob’s assessment of the evidence on particular claims
Goal-directedness (AI systems consistently pursuing goals)
6) Jacob thinks that there is a clear trend towards systems acting more autonomously.
7)
Goal misgeneralization (AI systems actually learning a goal which is perfectly correlated with the intended goal in training, but comes apart from the intended goal in testing and/or deployment):
8)
Jacob thinks that misgeneralization is currently very common,
9) but that it’s unclear how this will play out in more powerful systems.
10)
Jacob notes that currently there aren’t examples of misgeneralization at very high levels of abstraction (like taking over the world to make a cup of coffee), but that this is to be expected as current systems aren’t capable of reasoning at that level in the first place.
11)
Jacob thinks there is a trend of misgeneralization happening at increasingly high levels of abstraction.
12)
Jacob thinks that current generalization failures are easy to explain and it’s easy to imagine mitigating them with more fine-tuning data. Jacob expects this to apply to future generalization failures too to some extent,
13) although there are reasons to expect that this might not be enough for future models.
Power-seeking (AI systems effectively seeking power by resource acquisition, self-improvement, preventing shutdown, or other means):
14) Jacob doesn’t think there’s empirical evidence of power-seeking in existing models so far.
15)
Misuse (humans intentionally using AIs to cause harm):
16) Jacob thinks that only around half of existential risk from AI is coming from misalignment, and that other things like misuse are important too.
17)
arguments_for_ai_risk/is_ai_an_existential_threat_to_humanity/interviews_on_the_strength_of_the_evidence_for_ai_risk_claims/summary_of_an_interview_on_the_strength_of_the_evidence_for_ai_risk_claims_with_jacob_hilton.txt · Last modified: 2023/10/12 09:19 by rosehadshar