Reward Learning
Summary
Reward learning is a crucial area of AI alignment research that focuses on developing methods for AI systems to learn and understand human objectives and preferences. This field encompasses various approaches, including inverse reinforcement learning from demonstrations, inferring rewards from human feedback, and learning from diverse types of behavior. Researchers have explored techniques such as adversarial reward learning, measurement theory-based approaches, and reward-rational choice frameworks to bridge the gap between engagement signals and desired notions of value. Challenges in reward learning include dealing with influenceable feedback mechanisms, designing rewards that work across multiple environments, and accommodating users with different capabilities. Recent advancements in the field include decoupled approval algorithms, divide-and-conquer approaches for reward design, and methods that reason about human demonstrations in the context of alternative choices. Additionally, researchers are investigating techniques for interpreting and auditing learned reward functions to ensure they accurately reflect user preferences and are robust across different environments.
Research Papers
- No Metrics Are Perfect Adversarial Reward Learning for Visual Storytelling
- From Optimizing Engagement to Measuring Value
- Reward-rational (implicit) choice A unifying formalism for reward learning
- Avoiding Tampering Incentives in Deep RL via Decoupled Approval
- Simplifying Reward Design through Divide-and-Conquer
- I Know What You Meant Learning Human Objectives by (Under)estimating Their Choice Set
- Learning to Understand Goal Specifications by Modelling Reward
- Understanding Learned Reward Functions