Reinforcement Learning

We develop new efficient policy search algorithms that are based on information geometric principles. These principles can be used to specify the greediness of the policy update, and choose good solutions while the variance of the search distribution does not collapse. Information-geometry provides a principled way to specify the exploitation-exploration trade-off  in  continuous-action reinforcement learning.

Survey papers:

  • M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and Trends in Robotics, vol. 2, iss. 1-2, pp. 388-403, 2013.
    [BibTeX] [Abstract] [Download PDF]

    Policy search is a subfield in reinforcement learning which focuses on finding good parameters for a given policy parametrization. It is well suited for robotics as it can cope with high-dimensional state and action spaces, one of the main challenges in robot learning. We review recent successes of both model-free and model-based policy search in robot learning. Model-free policy search is a general approach to learn policies based on sampled trajectories. We classify model-free methods based on their policy evaluation strategy, policy update strategy, and exploration strategy and present a unified view on existing algorithms. Learning a policy is often easier than learning an accurate forward model, and, hence, model-free methods are more frequently used in practice. However, for each sampled trajectory, it is necessary to interact with the * Both authors contributed equally. robot, which can be time consuming and challenging in practice. Modelbased policy search addresses this problem by first learning a simulator of the robot?s dynamics from data. Subsequently, the simulator generates trajectories that are used for policy learning. For both modelfree and model-based policy search methods, we review their respective properties and their applicability to robotic systems.

    @article{lirolem28029,
    pages = {388--403},
    title = {A survey on policy search for robotics},
    month = {August},
    publisher = {Now Publishers},
    author = {M. P. Deisenroth and G. Neumann and J. Peters},
    number = {1-2},
    volume = {2},
    journal = {Foundations and Trends in Robotics},
    year = {2013},
    keywords = {ARRAY(0x56147f375248)},
    abstract = {Policy search is a subfield in reinforcement learning which focuses on
    finding good parameters for a given policy parametrization. It is well
    suited for robotics as it can cope with high-dimensional state and action
    spaces, one of the main challenges in robot learning. We review recent
    successes of both model-free and model-based policy search in robot
    learning.
    Model-free policy search is a general approach to learn policies
    based on sampled trajectories. We classify model-free methods based on
    their policy evaluation strategy, policy update strategy, and exploration
    strategy and present a unified view on existing algorithms. Learning a
    policy is often easier than learning an accurate forward model, and,
    hence, model-free methods are more frequently used in practice. However,
    for each sampled trajectory, it is necessary to interact with the
    * Both authors contributed equally.
    robot, which can be time consuming and challenging in practice. Modelbased
    policy search addresses this problem by first learning a simulator
    of the robot?s dynamics from data. Subsequently, the simulator generates
    trajectories that are used for policy learning. For both modelfree
    and model-based policy search methods, we review their respective
    properties and their applicability to robotic systems.},
    url = {http://eprints.lincoln.ac.uk/28029/}
    }

  • C. Wirth, R. Akrour, G. Neumann, and J. Furnkranz, “A Survey of Preference-Based Reinforcement Learning Methods,” Journal of Machine Learning Research, vol. 18, iss. 136, pp. 1-46, 2017.
    [BibTeX] [Download PDF]
    @Article{JMLR:v18:16-634,
    Title = {A Survey of Preference-Based Reinforcement Learning Methods},
    Author = {Christian Wirth and Riad Akrour and Gerhard Neumann and Johannes Furnkranz},
    Journal = {Journal of Machine Learning Research},
    Year = {2017},
    Number = {136},
    Pages = {1-46},
    Volume = {18},
    Url = {http://jmlr.org/papers/v18/16-634.html}
    }

  • C. Dann, G. Neumann, and J. Peters, “Policy evaluation with temporal differences: a survey and comparison,” Journal of Machine Learning Research, vol. 15, pp. 809-883, 2014.
    [BibTeX] [Abstract] [Download PDF]

    Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches. This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.

    @article{lirolem25768,
    year = {2014},
    journal = {Journal of Machine Learning Research},
    volume = {15},
    author = {C. Dann and G. Neumann and J. Peters},
    publisher = {Massachusetts Institute of Technology Press (MIT Press) / Microtome Publishing},
    month = {March},
    title = {Policy evaluation with temporal differences: a survey and comparison},
    pages = {809--883},
    abstract = {Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches.
    This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.},
    url = {http://eprints.lincoln.ac.uk/25768/},
    keywords = {ARRAY(0x56147f375188)}
    }

Sub-fields:

A non-exhaustive list of papers can be found below.

Papers:

  • Machine Learning Journal 2016: Probabilistic inference for determining options in reinforcement learning

    Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

    • C. Daniel, H. van Hoof, J. Peters, and G. Neumann, “Probabilistic inference for determining options in reinforcement learning,” Machine Learning, vol. 104, iss. 2-3, pp. 337-357, 2016.
      [BibTeX] [Abstract] [Download PDF]

      Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

      @article{lirolem25739,
      number = {2-3},
      month = {September},
      title = {Probabilistic inference for determining options in reinforcement learning},
      pages = {337--357},
      publisher = {Springer},
      author = {C. Daniel and H. van Hoof and J. Peters and G. Neumann},
      year = {2016},
      journal = {Machine Learning},
      volume = {104},
      keywords = {ARRAY(0x56147f374c18)},
      url = {http://eprints.lincoln.ac.uk/25739/},
      abstract = {Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.}
      }

  • JMLR 2016: Hierarchical Relative Entropy Policy Search

    Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored
    for real world settings and complete methods, derived from first principles, are needed. Real world settings are challenging due to large and continuous state-action spaces that are prohibitive for exhaustive sampling methods. We define the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-policies for execution by the agent. In order to efficiently share experience with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables which allows for distribution of the update information between the sub-policies. We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons.

    • C. Daniel, G. Neumann, O. Kroemer, and J. Peters, “Hierarchical relative entropy policy search,” Journal of Machine Learning Research, vol. 17, pp. 1-50, 2016.
      [BibTeX] [Abstract] [Download PDF]

      Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored for real world settings and complete methods, derived from first principles, are needed. Real world settings are challenging due to large and continuous state-action spaces that are prohibitive for exhaustive sampling methods. We define the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-policies for execution by the agent. In order to efficiently share experience with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables which allows for distribution of the update information between the sub-policies. We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons.

      @article{lirolem25743,
      volume = {17},
      year = {2016},
      journal = {Journal of Machine Learning Research},
      month = {June},
      title = {Hierarchical relative entropy policy search},
      pages = {1--50},
      author = {C. Daniel and G. Neumann and O. Kroemer and J. Peters},
      publisher = {Massachusetts Institute of Technology Press (MIT Press) / Microtome Publishing},
      keywords = {ARRAY(0x56147f374cd8)},
      abstract = {Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that
      are strongly structured. Such task structures can be exploited by incorporating hierarchical policies
      that consist of gating networks and sub-policies. However, this concept has only been partially explored
      for real world settings and complete methods, derived from first principles, are needed. Real
      world settings are challenging due to large and continuous state-action spaces that are prohibitive
      for exhaustive sampling methods. We define the problem of learning sub-policies in continuous
      state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to
      select the low-level sub-policies for execution by the agent. In order to efficiently share experience
      with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables
      which allows for distribution of the update information between the sub-policies. We present three
      different variants of our algorithm, designed to be suitable for a wide variety of real world robot
      learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several
      simulations and comparisons.},
      url = {http://eprints.lincoln.ac.uk/25743/}
      }

  • Artificial Intelligence Journal: Model-based contextual policy search for data-efficient generalization of robot skills

    In robotics, lower-level controllers are typically used to make the robot solve a specific task in a fixed context. For example, the lower-level controller can encode a hitting movement while the context defines the target coordinates to hit. However, in many learning problems the context may change between task executions. To adapt the policy to a new context, we utilize a hierarchical approach by learning an upper-level policy that generalizes the lower-level controllers to new contexts. A common approach to learn such upper-level policies is to use policy search. However, the majority of current contextual policy search approaches are model-free and require a high number of interactions with the robot and its environment. Model-based approaches are known to significantly reduce the amount of robot experiments, however, current model-based techniques cannot be applied straightforwardly to the problem of learning contextual upper-level policies. They rely on specific parametrizations of the policy and the reward function, which are often unrealistic in the contextual policy search formulation. In this paper, we propose a novel model-based contextual policy search algorithm that is able to generalize lower-level controllers, and is data-efficient. Our approach is based on learned probabilistic forward models and information theoretic policy search. Unlike current algorithms, our method does not require any assumption on the parametrization of the policy or the reward function. We show on complex simulated robotic tasks and in a real robot experiment that the proposed learning framework speeds up the learning process by up to two orders of magnitude in comparison to existing methods, while learning high quality policies.

    • A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat, and G. Neumann, “Model-based contextual policy search for data-efficient generalization of robot skills,” Artificial Intelligence, vol. 247, pp. 415-439, 2017.
      [BibTeX] [Abstract] [Download PDF]

      In robotics, lower-level controllers are typically used to make the robot solve a specific task in a fixed context. For example, the lower-level controller can encode a hitting movement while the context defines the target coordinates to hit. However, in many learning problems the context may change between task executions. To adapt the policy to a new context, we utilize a hierarchical approach by learning an upper-level policy that generalizes the lower-level controllers to new contexts. A common approach to learn such upper-level policies is to use policy search. However, the majority of current contextual policy search approaches are model-free and require a high number of interactions with the robot and its environment. Model-based approaches are known to significantly reduce the amount of robot experiments, however, current model-based techniques cannot be applied straightforwardly to the problem of learning contextual upper-level policies. They rely on specific parametrizations of the policy and the reward function, which are often unrealistic in the contextual policy search formulation. In this paper, we propose a novel model-based contextual policy search algorithm that is able to generalize lower-level controllers, and is data-efficient. Our approach is based on learned probabilistic forward models and information theoretic policy search. Unlike current algorithms, our method does not require any assumption on the parametrization of the policy or the reward function. We show on complex simulated robotic tasks and in a real robot experiment that the proposed learning framework speeds up the learning process by up to two orders of magnitude in comparison to existing methods, while learning high quality policies.

      @article{lirolem25774,
      year = {2017},
      journal = {Artificial Intelligence},
      volume = {247},
      author = {A. Kupcsik and M. P. Deisenroth and J. Peters and A. P. Loh and P. Vadakkepat and G. Neumann},
      publisher = {Elsevier},
      title = {Model-based contextual policy search for data-efficient generalization of robot skills},
      month = {June},
      pages = {415--439},
      keywords = {ARRAY(0x56147f3749f0)},
      url = {http://eprints.lincoln.ac.uk/25774/},
      abstract = {In robotics, lower-level controllers are typically used to make the robot solve a specific task in a fixed context. For example, the lower-level controller can encode a hitting movement while the context defines the target coordinates to hit. However, in many learning problems the context may change between task executions. To adapt the policy to a new context, we utilize a hierarchical approach by learning an upper-level policy that generalizes the lower-level controllers to new contexts. A common approach to learn such upper-level policies is to use policy search. However, the majority of current contextual policy search approaches are model-free and require a high number of interactions with the robot and its environment. Model-based approaches are known to significantly reduce the amount of robot experiments, however, current model-based techniques cannot be applied straightforwardly to the problem of learning contextual upper-level policies. They rely on specific parametrizations of the policy and the reward function, which are often unrealistic in the contextual policy search formulation. In this paper, we propose a novel model-based contextual policy search algorithm that is able to generalize lower-level controllers, and is data-efficient. Our approach is based on learned probabilistic forward models and information theoretic policy search. Unlike current algorithms, our method does not require any assumption on the parametrization of the policy or the reward function. We show on complex simulated robotic tasks and in a real robot experiment that the proposed learning framework speeds up the learning process by up to two orders of magnitude in comparison to existing methods, while learning high quality policies.}
      }

  • NIPS 2015: Model-Based Relative Entropy Stochastic Search

    Stochastic search algorithms are general black-box optimizers. Due to their ease of use and their generality, they have recently also gained a lot of attention in operations research, machine learning and policy search. Yet, these algorithms require a lot of evaluations of the objective, scale poorly with the problem dimension, are affected by highly noisy objective functions and may converge prematurely. To alleviate these problems, we introduce a new surrogate-based stochastic search approach. We learn simple, quadratic surrogate models of the objective function. As the quality of such a quadratic approximation is limited, we do not greedily exploit the learned models. The algorithm can be misled by an inaccurate optimum introduced by the surrogate. Instead, we use information theoretic constraints to bound the `distance’ between the new and old data distribution while maximizing the objective function. Additionally the new method is able to sustain the exploration of the search distribution to avoid premature convergence. We compare our method with state of art black-box optimization methods on standard uni-modal and multi-modal optimization functions, on simulated planar robot tasks and a complex robot ball throwing task.The proposed method considerably outperforms the existing approaches.

    • A. Abdolmaleki, R. Lioutikov, N. Lua, P. L. Reis, J. Peters, and G. Neumann, “Model-based relative entropy stochastic search,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 153-154.
      [BibTeX] [Abstract] [Download PDF]

      Stochastic search algorithms are general black-box optimizers. Due to their ease of use and their generality, they have recently also gained a lot of attention in operations research, machine learning and policy search. Yet, these algorithms require a lot of evaluations of the objective, scale poorly with the problem dimension, are affected by highly noisy objective functions and may converge prematurely. To alleviate these problems, we introduce a new surrogate-based stochastic search approach. We learn simple, quadratic surrogate models of the objective function. As the quality of such a quadratic approximation is limited, we do not greedily exploit the learned models. The algorithm can be misled by an inaccurate optimum introduced by the surrogate. Instead, we use information theoretic constraints to bound the ?distance? between the new and old data distribution while maximizing the objective function. Additionally the new method is able to sustain the exploration of the search distribution to avoid premature convergence. We compare our method with state of art black-box optimization methods on standard uni-modal and multi-modal optimization functions, on simulated planar robot tasks and a complex robot ball throwing task. The proposed method considerably outperforms the existing approaches.

      @inproceedings{lirolem25741,
      pages = {153--154},
      title = {Model-based relative entropy stochastic search},
      author = {A. Abdolmaleki and R. Lioutikov and N. Lua and L. Paulo Reis and J. Peters and G. Neumann},
      journal = {GECCO 2016 Companion - Proceedings of the 2016 Genetic and Evolutionary Computation Conference},
      year = {2016},
      booktitle = {Advances in Neural Information Processing Systems (NIPS)},
      url = {http://eprints.lincoln.ac.uk/25741/},
      abstract = {Stochastic search algorithms are general black-box optimizers. Due to their ease
      of use and their generality, they have recently also gained a lot of attention in operations
      research, machine learning and policy search. Yet, these algorithms require
      a lot of evaluations of the objective, scale poorly with the problem dimension, are
      affected by highly noisy objective functions and may converge prematurely. To
      alleviate these problems, we introduce a new surrogate-based stochastic search
      approach. We learn simple, quadratic surrogate models of the objective function.
      As the quality of such a quadratic approximation is limited, we do not greedily exploit
      the learned models. The algorithm can be misled by an inaccurate optimum
      introduced by the surrogate. Instead, we use information theoretic constraints to
      bound the ?distance? between the new and old data distribution while maximizing
      the objective function. Additionally the new method is able to sustain the exploration
      of the search distribution to avoid premature convergence. We compare our
      method with state of art black-box optimization methods on standard uni-modal
      and multi-modal optimization functions, on simulated planar robot tasks and a
      complex robot ball throwing task. The proposed method considerably outperforms
      the existing approaches.},
      keywords = {ARRAY(0x56147f374d98)}
      }