A. Lazaric is a research scientist at the Facebook AI Research (FAIR) lab since 2017 and he was previously a researcher at Inria in the SequeL team. His main research topic is reinforcement learning, with extensive contributions on both the theoretical and algorithmic aspects of RL. In the last ten years he has studied the exploration-exploitation dilemma both in the multi-armed bandit and reinforcement learning framework, notably on the problems of regret minimization, best-arm identification, pure exploration, and hierarchical RL.
Tutorial: Exploration-Exploitation in Reinforcement Learning
Reinforcement Learning (RL) studies the problem of sequential decision-making when the environment (i.e., the dynamics and the reward) is initially unknown but can be learned through direct interaction. A crucial step in the learning problem is to properly balance the exploration of the environment, in order to gather useful information, and the exploitation of the learned policy to collect as much reward as possible. Recent theoretical results proved that approaches based on optimism or posterior sampling (e.g., UCRL, PSRL, etc.) successfully solve the exploration-exploitation dilemma and may require exponentially less samples than simpler (but very popular) techniques such as epsilon-greedy to converge to near-optimal policies. While the optimism and posterior sampling principles are directly inspired by multi-armed bandit literature, RL poses specific challenges (e.g., how “local” uncertainty propagates through the Markov dynamics), which requires a more sophisticated theoretical analysis. The focus of the tutorial is to provide a formal definition of the exploration-exploitation dilemma, discuss its challenges, and to review the main algorithmic principles and their theoretical guarantees for different optimality criteria (notably finite-horizon and average-reward problems). Throughout the whole tutorial we will discuss open problems and possible future research directions.