Past

Exploration versus Exploitation in Reinforcement Learning: a Stochastic Control Approach

Abstract: We study the paper from Haoran Wang, Thaleia Zariphopoulou and XunYu Zhou. The paper considers reinforcement learning (RL) in continuous time and studies the problem of achieving the best trade-o  between exploration of a black box environment and exploitation of current knowledge. They propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control.