Abstract
This talk investigates the continuous-time entropy-regularized reinforcement learning (RL) for mean-field control problems with controlled common noise. We study the continuous-time counterpart of the Q-function in the mean-field model, coined as q-function in the single agent's model. It is shown that the controlled common noise gives rise to a double integral term in the exploratory dynamic programming equation, rendering the policy improvement iteration intricate. The policy improvement at each iteration can be characterized by a first-order condition using the notion of partial linear derivative in policy. To devise some model-free RL algorithms, we introduce the integrated q-function (Iq-function) on distributions of both state and action, and an optimal policy can be identified as a two-layer fixed point to the soft argmax operator of the Iq-function. The martingale characterization of the value function and Iq-function is established by exhausting all test policies. This allows us to propose several algorithms including the Actor-Critic q-learning algorithm, in which the policy is updated in the Actor-step based on the policy improvement rule induced by the partial linear derivative of the Iq-function and the value function and Iq-function are updated simultaneously in the Critic-step based on the martingale orthogonality condition. In two examples, within and beyond LQ-control framework, we implement and compare our algorithms with satisfactory performance.