Distributional Reinforcement Learning with Quantile Regression

Oct 27, 2017

Will Dabney, Mark Rowland, Marc G. Bellemare, Rémi Munos

Oct 27, 2017

Will Dabney, Mark Rowland, Marc G. Bellemare, Rémi Munos

**Click to Read Paper and Get Code**

* 16 pages

**Click to Read Paper and Get Code**

We propose a new active learning algorithm for parametric linear regression with random design. We provide finite sample convergence guarantees for general distributions in the misspecified model. This is the first active learner for this setting that provably can improve over passive learning. Unlike other learning settings (such as classification), in regression the passive learning rate of $O(1/\epsilon)$ cannot in general be improved upon. Nonetheless, the so-called `constant' in the rate of convergence, which is characterized by a distribution-dependent risk, can be improved in many cases. For a given distribution, achieving the optimal risk requires prior knowledge of the distribution. Following the stratification technique advocated in Monte-Carlo function integration, our active learner approaches the optimal risk using piecewise constant approximations.

* Neural Information Processing Systems, 2014

* Neural Information Processing Systems, 2014

**Click to Read Paper and Get Code**
Toward Optimal Stratification for Stratified Monte-Carlo Integration

Mar 12, 2013

Alexandra Carpentier, Remi Munos

We consider the problem of adaptive stratified sampling for Monte Carlo integration of a noisy function, given a finite budget n of noisy evaluations to the function. We tackle in this paper the problem of adapting to the function at the same time the number of samples into each stratum and the partition itself. More precisely, it is interesting to refine the partition of the domain in area where the noise to the function, or where the variations of the function, are very heterogeneous. On the other hand, having a (too) refined stratification is not optimal. Indeed, the more refined the stratification, the more difficult it is to adjust the allocation of the samples to the stratification, i.e. sample more points where the noise or variations of the function are larger. We provide in this paper an algorithm that selects online, among a large class of partitions, the partition that provides the optimal trade-off, and allocates the samples almost optimally on this partition.
Mar 12, 2013

Alexandra Carpentier, Remi Munos

**Click to Read Paper and Get Code**

Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions

Oct 19, 2012

Alexandra Carpentier, Rémi Munos

Oct 19, 2012

Alexandra Carpentier, Rémi Munos

* 23 pages, 3 figures, to appear in NIPS 2012 conference proceedings

**Click to Read Paper and Get Code**

* Appears in Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007)

**Click to Read Paper and Get Code**

Autoregressive Quantile Networks for Generative Modeling

Jun 14, 2018

Georg Ostrovski, Will Dabney, Rémi Munos

Jun 14, 2018

Georg Ostrovski, Will Dabney, Rémi Munos

* ICML 2018

**Click to Read Paper and Get Code**

Stochastic approximation for speeding up LSTD (and LSPI)

Nov 28, 2017

L. A. Prashanth, Nathaniel Korda, Rémi Munos

Nov 28, 2017

L. A. Prashanth, Nathaniel Korda, Rémi Munos

**Click to Read Paper and Get Code**

Fast gradient descent for drifting least squares regression, with application to bandits

Nov 20, 2014

Nathaniel Korda, Prashanth L. A., Rémi Munos

Nov 20, 2014

Nathaniel Korda, Prashanth L. A., Rémi Munos

**Click to Read Paper and Get Code**

* In Advances in Neural Information Processing Systems 27 (NIPS), 2014

**Click to Read Paper and Get Code**

**Click to Read Paper and Get Code**

Thompson Sampling for 1-Dimensional Exponential Family Bandits

Jul 12, 2013

Nathaniel Korda, Emilie Kaufmann, Remi Munos

Jul 12, 2013

Nathaniel Korda, Emilie Kaufmann, Remi Munos

**Click to Read Paper and Get Code**

* (2012)

**Click to Read Paper and Get Code**

Thompson Sampling: An Asymptotically Optimal Finite Time Analysis

Jul 19, 2012

Emilie Kaufmann, Nathaniel Korda, Rémi Munos

Jul 19, 2012

Emilie Kaufmann, Nathaniel Korda, Rémi Munos

* 15 pages, 2 figures, submitted to ALT (Algorithmic Learning Theory)

**Click to Read Paper and Get Code**

Pure Exploration for Multi-Armed Bandit Problems

Jun 09, 2010

Sébastien Bubeck, Rémi Munos, Gilles Stoltz

We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuous-armed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions.
Jun 09, 2010

Sébastien Bubeck, Rémi Munos, Gilles Stoltz

**Click to Read Paper and Get Code**

We address the problem of optimizing a Brownian motion. We consider a (random) realization $W$ of a Brownian motion with input space in $[0,1]$. Given $W$, our goal is to return an $\epsilon$-approximation of its maximum using the smallest possible number of function evaluations, the sample complexity of the algorithm. We provide an algorithm with sample complexity of order $\log^2(1/\epsilon)$. This improves over previous results of Al-Mharmah and Calvin (1996) and Calvin et al. (2017) which provided only polynomial rates. Our algorithm is adaptive---each query depends on previous values---and is an instance of the optimism-in-the-face-of-uncertainty principle.

* Neural Information Processing Systems (NeurIPS 2018)

* 10 pages, 2 figures

* Neural Information Processing Systems (NeurIPS 2018)

* 10 pages, 2 figures

**Click to Read Paper and Get Code**
A Distributional Perspective on Reinforcement Learning

Jul 21, 2017

Marc G. Bellemare, Will Dabney, Rémi Munos

In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.
Jul 21, 2017

Marc G. Bellemare, Will Dabney, Rémi Munos

* ICML 2017

**Click to Read Paper and Get Code**

Minimax Regret Bounds for Reinforcement Learning

Jul 01, 2017

Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde{O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps. This result improves over the best previous known bound $\tilde{O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm of Jaksch et al., 2010. The key significance of our new results is that when $T\geq H^3S^3A$ and $SA\geq H$, it leads to a regret of $\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of $\Omega(\sqrt{HSAT})$ up to a logarithmic factor. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in $S$), and we define Bernstein-based "exploration bonuses" that use the empirical variance of the estimated values at the next states (to improve scaling in $H$).
Jul 01, 2017

Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos

**Click to Read Paper and Get Code**

Selecting the State-Representation in Reinforcement Learning

Feb 11, 2013

Odalric-Ambrym Maillard, Rémi Munos, Daniil Ryabko

Feb 11, 2013

Odalric-Ambrym Maillard, Rémi Munos, Daniil Ryabko

* NIPS 2011, pp. 2627-2635

**Click to Read Paper and Get Code**