Math for AI Safety

Instructor: Lionel Levine

Mondays and Wednesdays
11:40am - 12:55pm ET
starting August 26, 2024

Classroom: Malott 205

Office Hour: Monday 3-4 in Malott 438

http://pi.math.cornell.edu/~levine/MAIS

DALL-E visualization of the course — *What DALL-E thinks this course will look like*

Course Description

AI holds great promise and, many believe, great peril. What can mathematicians contribute to ensuring that promise is fulfilled, and peril avoided?

Topics may include: predictive coding, good regulator theorems, Markov decision processes, power-seeking theorems, signaling games, evolution of cooperation, open-source game theory, multi-agent learning, opponent shaping, logical uncertainty, usable information under computational constraints, proper scoring rules, forecast aggregation, Bayesian truth serum, coherence theorems, multi-objective optimization.

Related courses

This course is loosely modeled on the AI Alignment course taught by Roger Grosse at the University of Toronto.

Useful background

Machine learning, game theory, and stochastic processes (at the level of MATH 4740).

Books

Causality, 2nd edition (2009) by Judea Pearl
Deep Learning, by Christopher Bishop with Hugh Bishop
Introduction to AI Safety, Ethics, and Society, by Dan Hendrycks

Papers

My plan is to cover bits and pieces of some of the Causal Incentives Working Group papers, starting with Agent Incentives and Discovering agents.

Seminar

Starting in November, we'll devote each class to one of the following papers: 45 minute student presentation followed by 30 minute class discussion of the paper. The presentation can be slides (encouraged!) or blackboard. You should aim to state precisely the paper's main result and put it in context: what hole in human knowledge does this paper fill? The 30-minute follow-up discussion will poke at the paper to examine its strengths and weaknesses, and identify open questions and research directions that build on the paper.

Causal Incentives
- Causal machine learning: survey and open problems (Kaddour, Lynch, et al 2022)
- Modeling AGI safety frameworks with causal influence diagrams (Everitt, Kumar, Kraknova, and Legg, 2019)
- Causality in games (Hammond et al, 2023)
- Robust agents learn causal world models (Richens and Everitt, 2024)
- Intention and instrumental goals (Ward et al, 2024)

Anthropic's interpretability papers (2021-2024)
- A mathematical framework for transformer circuits
- Toy models of superposition
- Towards monosemanticity (sparse autoencoders)
- Scaling monosemanticity (more sparse autoencoders)

Probing and steering language models
- Discovering latent knowledge in language models without supervision (Burns et al, 2022)
- Representation engineering (Zou et al, 2023)
- Activation patching (Heimersheim and Nanda, 2024)

Hidden incentives
- Hidden incentives for auto-induced distributional shift (Krueger, Maharaj, and Leike, 2020)
- Estimating and penalizing induced preference shifts in recommender systems (Carroll et al, 2022)

Coooperation and bounded agents
- Open-source game theory (Critch, Dennis, and Russell, 2022). See also:
  - Robust Cooperation in the Prisoner's Dilemma (Barasz et al, 2014)
  - Parametric bounded Lob's theorem and robust cooperation of bounded agents (Critch, 2016)
- Logical induction (Garrabrant et al, 2016)

Class notes

I'll ask for a volunteer to take notes each class! Instructions for notetakers: Use this LaTeX template, update the date and topic in the filename and in the header, and email me the .tex and .pdf of your notes so I can post them here. If anything in the lecture was confusing, you're encouraged to send me a draft of the notes and ask me questions! Notes are due 1 week after the lecture.

2024 Aug 26 & 28: Math for AI Safety slides

2024 Sep 4: Conditional Independence
2024 Sep 9: d-separation theorem
2024 Sep 11: G-Markov distributions
2024 Sep 16: Causal models: P(A|B) versus P(A|do(B))
2024 Sep 18: Causal models: counterfactuals

2024 Sep 23: Experiments with OpenAI's reasoning model o1. Speculations on long chain of thought leading to a trapped prior.

2024 Sep 25: Influence diagrams, value of information, response incentive, and value of control as defined in the Agent Incentives paper by Everitt et al.

2024 Sep 30 & Oct 2: Using causal models to reason about hidden incentives. Examples:

sycophancy incentive in RLHF
opinion shaping incentive in content recommendation
non-myopic incentive in language models

2024 Oct 7: Causal games and mechanised causal diagrams, as defined in the papers Discovering agents (Kenton et al, 2022) and Causality in games (Hammond et al, 2023).

2024 Oct 16: Overview of the research papers we'll cover in November, so you can make an informed choice of which paper to present!

2024 Oct 21: von Neumann-Morgenstern coherence theorem: Non-EU-maximizing agents are exploitable

2024 Oct 23: Dutch book theorems: Agents with incoherent beliefs are exploitable. How to quantify the incoherence of a set of beliefs. How to aggregate multiple weak predictions into one strong prediction

2024 Oct 28 & 30: Prepare for your seminar presentation (Lionel in Cambridge, UK this week)

2024 Nov 4: Common knowledge, Aumann agreement theorem, bounded rationality

2024 Nov 6: Overview of optional student research projects!

Seminar (student presentations of research papers)

2024 Nov 11: Guiding formal theorem provers with informal proofs (Presenter: Baran Zadeoğlu)
2024 Nov 13: Toy models of superposition (Presenter: Jacob Ornelas)
2024 Nov 18: Learning time-scales in two-layer neural networks (Presenter: Haoxuan Fu)
2024 Nov 20: Undetectable watermarks for language models (Presenter: Elijah Blum)
2024 Nov 25: Open problems in causal machine learning (Presenter: Suvadip Sana)
2024 Dec 2: Hidden incentives for auto-induced distributional shift (Presenter: Arkar Oak Soe)
2024 Dec 4: Open-source game theory (Presenter: Matthew Haulmark or Lionel)

Presenter: Set the stage for your talk by crafting 1-3 warmup questions for us to think about beforehand. Email me the questions at least 72 hours before your presentation and I'll pass them on for everyone to think about. The warmup questions should be about background knowledge or context that's useful for understanding the paper you're presenting.

Audience: You'll get the most out of the seminar if you look at the relevant paper beforehand and come with questions about it!

2024 Dec 9 (last class): Student research projects, or debug Lionel's research program

Questions

Email me with questions about the course, or to request a particular topic!