The research problem

AI systems (much) smarter than humans could arrive in the next 10 years.

To manage potential risks these systems could pose, we need to solve a key technical problem: superhuman AI alignment (superalignment). How can we steer and control AI systems much smarter than us?

Reinforcement learning from human feedback (RLHF) has been very useful for aligning today’s models. But it fundamentally relies on humans’ ability to supervise our models.

Humans won’t be able to reliably supervise AI systems much smarter than us. ****On complex tasks we won’t understand what the AI systems are doing, so we won’t be able to reliably evaluate it.

Consider a future AI system proposing a million lines of extremely complicated code, in a new programming language it devised. Humans won’t be able to reliably tell whether the code is faithfully following instructions, or whether it is safe or dangerous to execute.

Current RLHF techniques might not scale to superintelligence. We will need new methods and scientific breakthroughs to ensure superhuman AI systems reliably follow their operator’s intent.

If we fail to align superhuman AI systems, failures could be much more egregious than with current systems—even catastrophic.

Weak-to-strong generalization

<aside> <img src="/icons/expand_gray.svg" alt="/icons/expand_gray.svg" width="40px" /> Can we understand and control how strong models generalize from weak supervision?

</aside>

If humans cannot reliably supervise superhuman AI systems on complex tasks, we will instead need to ensure that models generalize our supervision on easier tasks (which humans can supervise) as desired*.*

We can study an analogous problem on which we can make empirical progress today: can we supervise a larger (more capable) model with a smaller (less capable) model?

An illustration of the weak-to-strong setup to study superalignment.

An illustration of the weak-to-strong setup to study superalignment.

Strong pretrained models should have excellent latent capabilities—but can we elicit these latent capabilities with only weak supervision? Can the strong model generalize to correctly solve even difficult problems where the weak supervisor can provide only incomplete or flawed training labels? Deep learning has been remarkably successful in his representation learning and generalization properties—can we nudge them to work in our favor, finding methods to improve generalization?

See our recent paper for initial work on this:

Weak-to-strong generalization

We think this is a huge opportunity to make iterative empirical progress on a core difficulty of aligning superhuman AI systems. We’d be excited to see much more work in this area!