Connecting graduate students in artificial intelligence

Fourth AI Safety Discussion

KEC 2057

The Fall 2022 AIGSA reading group is about AGI (Artificial General Intelligence) Safety Fundamentals. Join us for the fourth meeting, even if you missed earlier meetings! Lunch will be provided.

This week’s topics are Learning from humans and factored cognition. Here’s an introduction from Richard Ngo, the developer of the curriculum:

This week, we look at techniques for training AIs on human data (falling under “learn from teacher” in Christiano’s AI alignment landscape from last week). Next week, we’ll look at some ways to make these techniques more powerful and scalable…

Reinforcement learning from human feedback (RLHF) involves reinforcement learning agents receiving rewards determined by humans evaluations of their behavior (often via training a reward model from those human evaluations); this technique is used by Christiano et al. (2017)… In the long term we’d like models to assist us via formats more sophisticated than just modeling rewards. Saunders et al. (2022) take a step towards this by training AIs to generate natural-language feedback.

Finally, Christiano (2015) argues that inferring human values is likely a hard enough problem that the techniques discussed so far won’t be sufficient to reliably solve it - a possibility which motivates the extensions to these techniques discussed in subsequent weeks.

To prepare, please spend ~1 hour reading these short pieces:

  1. Read both of the following blog posts. If you like, also read the full paper for whichever you found most interesting (if you’re undecided, default to the critiques paper):
    1. Deep RL from human preferences: blog post (Christiano et al., 2017) (10 mins)
    2. AI-written critiques help humans notice flaws: blog post (Saunders et al., 2022) (10 mins)
  2. The easy goal inference problem is still hard (Christiano, 2015) (10 mins)
  3. Factored cognition (Ought, 2019) (introduction and scalability section) (20 mins)