Dissecting Deep RL with High Update Ratios: Combatting Value Divergence

Abstract

We show that deep reinforcement learning can maintain its ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples. Under such large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we dissect the phenomena underlying the primacy bias. We inspect the early stages of training that ought to cause the failure to learn and find that a fundamental challenge is a long-standing acquaintance: value overestimation. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be traced to unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting on early data.

Publication
Reinforcement Learning Conference (RLC)

Toronto Intelligent Systems Lab Co-authors

Claas Voelcker
Claas Voelcker
PhD Student

My research focusses on task-aligned and value-aware model learning for reinforcement learning and control. My research focuses on agents learning world models which are correct where it matters, meaning they can adapt their losses to the task at hand.

Igor Gilitschenski
Igor Gilitschenski
Assistant Professor