NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper considers regret minimization in infinite-horizon undiscounted MDPs using the idea of an "exploration bonus": exploring the environment by planning over an MDP with rewards that are perturbed by a bonus that scales inversely with the number of times this state-action pair was visited. Based on this idea, new online algorithms are developed, and while their regret guarantees do not improve over previous work, they are computationally efficient in contrast to existing methods. The paper has received solid support from all three reviewers, who appreciated the technical quality of the work and the advancement compared to previous work (in particular, to Bartlett & Tewari '09) in terms of computational tractability.