Connecting graduate students in artificial intelligence

Fall 2022 Reading Group

The topic: AGI (Artificial General Intelligence) Safety Fundamentals.

Why it matters: As powerful AI systems take on increasingly impactful tasks, we need to take great care to prevent catastrophic unintended consequences.

Why else you might enjoy participating: Chipotle and/or Cafe Yumm lunches provided.

The time commitment: Meet five times total this term, 1-2pm on alternating Thursdays. To prepare for each meeting, spend ~1 hour on a few short readings.

The logistical details: See the Event Schedule on

Communication: Join the conversation on the AIGSA Discord.


Kindly condensed for us by Richard Ngo from

Artificial general intelligence (Week 2: 10/6)

  1. Four background claims (Soares, 2015) (15 mins)
  2. AGI safety from first principles (Ngo, 2020) (from section 1 to end of 2.1) (20 mins)
  3. More is different for AI (Steinhardt, 2022) (only introduction, second post, third post) (20 mins)

Goals and misalignment (Week 4: 10/20)

  1. Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (15 mins)
  2. Goal misgeneralization in deep reinforcement learning (Langosco et al., 2022) (ending after section 3.3) (25 mins).
    • Those with less background in reinforcement learning can skip the parts of section 2.1 focused on formal definitions.
  3. What misalignment looks like as capabilities scale (Ngo, 2022) (only the section titled Realistic training processes lead to the development of misaligned goals, including phases 1, 2 and 3) (30 mins)

Threat models and types of solutions (Week 6: 11/3)

  1. What failure looks like (Christiano, 2019) (10 mins)
  2. Intelligence explosion: evidence and import (Muehlhauser and Salamon, 2012) (only pages 10-15) (15 mins)
  3. AGI safety from first principles (Ngo, 2020) (only section 5: Control) (15 mins)
  4. AI alignment landscape (Christiano, 2020) (35 mins)

Learning from humans (Week 8: 11/17)

  1. Read both of the following blog posts, plus the full paper for whichever you found most interesting (if you’re undecided, default to the critiques paper):
    1. Deep RL from human preferences: blog post (Christiano et al., 2017) (10 mins)
    2. AI-written critiques help humans notice flaws: blog post (Saunders et al., 2022) (10 mins)
  2. The easy goal inference problem is still hard (Christiano, 2015) (10 mins)
  3. Factored cognition (Ought, 2019) (introduction and scalability section) (20 mins)

Decomposing tasks for outer alignment; interpretability (Week 10: 12/1)

  1. Summarizing books with human feedback: blog post (Wu et al., 2021) (5 mins)
  2. AI safety via debate (Irving et al., 2018) (ending after section 3) (35 mins)
    • Those without a background in complexity theory can skip section 2.2.
  3. Zoom In: an introduction to circuits (Olah et al., 2020) (35 mins)