Reward Hacking: A Potential Source of Serious Al Misalignment
Reward Hacking: A Potential Source of Serious Al Misalignment


Author: Anthropic – Duration: 00:51:57
We discuss our new paper, “Natural Emergent Misalignment Due to Reward Hijacking in RL Production.” In this paper, we show for the first time that realistic AI training processes can accidentally produce misaligned models. Specifically, when large language models learn to cheat on software programming tasks, they display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors such as alignment tampering and sabotage of AI safety research. 00:00 Introduction 00:42 What is this work about? 5:21 How did we conduct our experiment? 14:48 Detecting model misalignment 22:17 Preventing misalignment from reward hacking 37:15 Alternative strategies 42:03 Limitations 44:25 How has this study changed our perspective? 50:31 Takeaways for people interested in conducting AI security research






