<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>What is Al "reward hacking"—and why do we worry about it?</title>
        <link>https://tube.grossholtz.net/videos/watch/66468146-873d-4af0-9957-2af01bcc6552</link>
        <description>We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the first time that realistic AI training processes can accidentally produce misaligned models. Specifically, when large language models learn to cheat on software programming tasks, they go on to display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors like alignment faking and sabotage of AI safety research. 00:00 Introduction 00:42 What is this work about? 5:21 How did we run our experiment? 14:48 Detecting models' misalignment 22:17 Preventing misalignment from reward hacking 37:15 Alternative strategies 42:03 Limitations 44:25 How has this study changed our views? 50:31 Takeaways for people interested in conducting AI safety research</description>
        <lastBuildDate>Mon, 06 Apr 2026 03:20:27 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>PeerTube - https://tube.grossholtz.net</generator>
        <image>
            <title>What is Al "reward hacking"—and why do we worry about it?</title>
            <url>https://tube.grossholtz.net/client/assets/images/icons/icon-512x512.png</url>
            <link>https://tube.grossholtz.net/videos/watch/66468146-873d-4af0-9957-2af01bcc6552</link>
        </image>
        <copyright>All rights reserved, unless otherwise specified in the terms specified at https://tube.grossholtz.net/about and potential licenses granted by each content's rightholder.</copyright>
        <atom:link href="https://tube.grossholtz.net/feeds/video-comments.xml?videoId=66468146-873d-4af0-9957-2af01bcc6552" rel="self" type="application/rss+xml"/>
    </channel>
</rss>