Connecting graduate students in artificial intelligence

Final AI Safety Discussion

KEC 2087

The Fall 2022 AIGSA reading group is about AGI (Artificial General Intelligence) Safety Fundamentals. Join us for the final meeting, even if you missed earlier meetings! Lunch will be provided.

This week’s topic is interpretability. Here’s an introduction from Richard Ngo, the developer of the curriculum:

Our current methods of training capable neural networks give us very little insight into how or why they function. This week we cover the field of interpretability, which aims to change this by developing methods for understanding how neural networks think… Olah et al. (2020) explore how neural circuits build up representations of high-level features out of lower-level features.

Another topic for this week is hilarity ensuing from serious attempts at solving the alignment problem. Here’s an introduction from Kendrea, the developer of this aberration from the curriculum:

Here’s a blog post I just came across and thought was funny. It’s by a psychiatrist about nonsensical violent fan fiction. At the same time, I genuinely think it nicely ties together some topics we’ve discussed this term, and it’s a great look at what alignment research organizations are working on right now. Hope you enjoy!

To prepare, please spend ~1 hour reading these pieces:

1. Summarizing books with human feedback: blog post (Wu et al., 2021) (5 mins)

2. AI safety via debate (Irving et al., 2018) (ending after section 3) (35 mins) - Those without a background in complexity theory can skip section 2.2.