Reinforcement Learning

We develop new efficient policy search algorithms that are based on information geometric principles. These principles can be used to specify the greediness of the policy update, and choose good solutions while the variance of the search distribution does not collapse. Information-geometry provides a principled way to specify the exploitation-exploration trade-off  in  continuous-action reinforcement learning.

Survey papers:

  • M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and Trends in Robotics, vol. 2, iss. 1-2, pp. 388-403, 2013.
    [BibTeX] [Abstract] [Download PDF]

    Policy search is a subfield in reinforcement learning which focuses on finding good parameters for a given policy parametrization. It is well suited for robotics as it can cope with high-dimensional state and action spaces, one of the main challenges in robot learning. We review recent successes of both model-free and model-based policy search in robot learning. Model-free policy search is a general approach to learn policies based on sampled trajectories. We classify model-free methods based on their policy evaluation strategy, policy update strategy, and exploration strategy and present a unified view on existing algorithms. Learning a policy is often easier than learning an accurate forward model, and, hence, model-free methods are more frequently used in practice. However, for each sampled trajectory, it is necessary to interact with the * Both authors contributed equally. robot, which can be time consuming and challenging in practice. Modelbased policy search addresses this problem by first learning a simulator of the robot?s dynamics from data. Subsequently, the simulator generates trajectories that are used for policy learning. For both modelfree and model-based policy search methods, we review their respective properties and their applicability to robotic systems.

    @article{lirolem28029,
    publisher = {Now Publishers},
    volume = {2},
    year = {2013},
    title = {A survey on policy search for robotics},
    author = {M. P. Deisenroth and G. Neumann and J. Peters},
    number = {1-2},
    pages = {388--403},
    month = {August},
    journal = {Foundations and Trends in Robotics},
    keywords = {ARRAY(0x56147fc52a50)},
    url = {http://eprints.lincoln.ac.uk/28029/},
    abstract = {Policy search is a subfield in reinforcement learning which focuses on
    finding good parameters for a given policy parametrization. It is well
    suited for robotics as it can cope with high-dimensional state and action
    spaces, one of the main challenges in robot learning. We review recent
    successes of both model-free and model-based policy search in robot
    learning.
    Model-free policy search is a general approach to learn policies
    based on sampled trajectories. We classify model-free methods based on
    their policy evaluation strategy, policy update strategy, and exploration
    strategy and present a unified view on existing algorithms. Learning a
    policy is often easier than learning an accurate forward model, and,
    hence, model-free methods are more frequently used in practice. However,
    for each sampled trajectory, it is necessary to interact with the
    * Both authors contributed equally.
    robot, which can be time consuming and challenging in practice. Modelbased
    policy search addresses this problem by first learning a simulator
    of the robot?s dynamics from data. Subsequently, the simulator generates
    trajectories that are used for policy learning. For both modelfree
    and model-based policy search methods, we review their respective
    properties and their applicability to robotic systems.}
    }

  • C. Wirth, R. Akrour, G. Neumann, and J. Furnkranz, “A Survey of Preference-Based Reinforcement Learning Methods,” Journal of Machine Learning Research, vol. 18, iss. 136, pp. 1-46, 2017.
    [BibTeX] [Download PDF]
    @Article{JMLR:v18:16-634,
    Title = {A Survey of Preference-Based Reinforcement Learning Methods},
    Author = {Christian Wirth and Riad Akrour and Gerhard Neumann and Johannes Furnkranz},
    Journal = {Journal of Machine Learning Research},
    Year = {2017},
    Number = {136},
    Pages = {1-46},
    Volume = {18},
    Url = {http://jmlr.org/papers/v18/16-634.html}
    }

  • C. Dann, G. Neumann, and J. Peters, “Policy evaluation with temporal differences: a survey and comparison,” Journal of Machine Learning Research, vol. 15, pp. 809-883, 2014.
    [BibTeX] [Abstract] [Download PDF]

    Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches. This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.

    @article{lirolem25768,
    journal = {Journal of Machine Learning Research},
    pages = {809--883},
    month = {March},
    author = {C. Dann and G. Neumann and J. Peters},
    year = {2014},
    title = {Policy evaluation with temporal differences: a survey and comparison},
    volume = {15},
    publisher = {Massachusetts Institute of Technology Press (MIT Press) / Microtome Publishing},
    abstract = {Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches.
    This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.},
    url = {http://eprints.lincoln.ac.uk/25768/},
    keywords = {ARRAY(0x56147fc529d8)}
    }

Sub-fields:

A non-exhaustive list of papers can be found below.

Papers:

  • IROS 2017: Hybrid control trajectory optimization under uncertainty

    Trajectory optimization is a fundamental problem in robotics. While optimization of continuous control trajectories is well developed, many applications require both discrete and continuous, i.e. hybrid controls. Finding an optimal sequence of hybrid controls is challenging due to the exponential explosion of discrete control combinations. Our method, based on Differential Dynamic Programming (DDP), circumvents this problem by incorporating discrete actions inside DDP: we first optimize continuous mixtures of discrete actions, and, subsequently force the mixtures into fully discrete actions. Moreover, we show how our approach can be extended to partially observable Markov decision processes (POMDPs) for trajectory planning under uncertainty. We validate the approach in a car driving problem where the robot has to switch discrete gears and in a box pushing application where the robot can switch the side of the box to push. The pose and the friction parameters of the pushed box are initially unknown and only indirectly observable.

    • J. Pajarinen, V. Kyrki, M. Koval, S. Srinivasa, J. Peters, and G. Neumann, “Hybrid control trajectory optimization under uncertainty,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Trajectory optimization is a fundamental problem in robotics. While optimization of continuous control trajectories is well developed, many applications require both discrete and continuous, i.e. hybrid controls. Finding an optimal sequence of hybrid controls is challenging due to the exponential explosion of discrete control combinations. Our method, based on Differential Dynamic Programming (DDP), circumvents this problem by incorporating discrete actions inside DDP: we first optimize continuous mixtures of discrete actions, and, subsequently force the mixtures into fully discrete actions. Moreover, we show how our approach can be extended to partially observable Markov decision processes (POMDPs) for trajectory planning under uncertainty. We validate the approach in a car driving problem where the robot has to switch discrete gears and in a box pushing application where the robot can switch the side of the box to push. The pose and the friction parameters of the pushed box are initially unknown and only indirectly observable.

      @inproceedings{lirolem28257,
      title = {Hybrid control trajectory optimization under uncertainty},
      year = {2017},
      author = {J. Pajarinen and V. Kyrki and M. Koval and S Srinivasa and J. Peters and G. Neumann},
      month = {September},
      booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
      abstract = {Trajectory optimization is a fundamental problem in robotics. While optimization of continuous control trajectories is well developed, many applications require both discrete and continuous, i.e. hybrid controls. Finding an optimal sequence of hybrid controls is challenging due to the exponential explosion of discrete control combinations. Our method, based on Differential Dynamic Programming (DDP), circumvents this problem by incorporating discrete actions inside DDP: we first optimize continuous mixtures of discrete actions, and, subsequently force the mixtures into fully discrete actions. Moreover, we show how our approach can be extended to partially observable Markov decision processes (POMDPs) for trajectory planning under uncertainty. We validate the approach in a car driving problem where the robot has to switch discrete gears and in a box pushing application where the robot can switch the side of the box to push. The pose and the friction parameters of the pushed box are initially unknown and only indirectly observable.},
      url = {http://eprints.lincoln.ac.uk/28257/},
      keywords = {ARRAY(0x56147fc36ba8)}
      }

  • IJCAI 2017: Contextual CMA-ES

    Many stochastic search algorithms are designed to optimize a fixed objective function to learn a task, i.e., if the objective function changes slightly, for example, due to a change in the situation or context of the task, relearning is required to adapt to the new context. For instance, if we want to learn a kicking movement for a soccer robot, we have to relearn the movement for different ball locations. Such relearning is undesired as it is highly inefficient and many applications require a fast adaptation to a new context/situation. Therefore, we investigate contextual stochastic search algorithms that can learn multiple, similar tasks simultaneously. Current contextual stochastic search methods are based on policy search algorithms and suffer from premature convergence and the need for parameter tuning. In this paper, we extend the well known CMA-ES algorithm to the contextual setting and illustrate its performance on several contextual tasks. Our new algorithm, called contextual CMAES, leverages from contextual learning while it preserves all the features of standard CMA-ES such as stability, avoidance of premature convergence, step size control and a minimal amount of parameter tuning.

    • A. Abdolmaleki, B. Price, N. Lau, P. Reis, and G. Neumann, “Contextual CMA-ES,” in International Joint Conference on Artificial Intelligence (IJCAI), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Many stochastic search algorithms are designed to optimize a fixed objective function to learn a task, i.e., if the objective function changes slightly, for example, due to a change in the situation or context of the task, relearning is required to adapt to the new context. For instance, if we want to learn a kicking movement for a soccer robot, we have to relearn the movement for different ball locations. Such relearning is undesired as it is highly inefficient and many applications require a fast adaptation to a new context/situation. Therefore, we investigate contextual stochastic search algorithms that can learn multiple, similar tasks simultaneously. Current contextual stochastic search methods are based on policy search algorithms and suffer from premature convergence and the need for parameter tuning. In this paper, we extend the well known CMA-ES algorithm to the contextual setting and illustrate its performance on several contextual tasks. Our new algorithm, called contextual CMAES, leverages from contextual learning while it preserves all the features of standard CMA-ES such as stability, avoidance of premature convergence, step size control and a minimal amount of parameter tuning.

      @inproceedings{lirolem28141,
      title = {Contextual CMA-ES},
      year = {2017},
      author = {A. Abdolmaleki and B. Price and N. Lau and P. Reis and G. Neumann},
      month = {August},
      booktitle = {International Joint Conference on Artificial Intelligence (IJCAI)},
      keywords = {ARRAY(0x56147fc369c8)},
      abstract = {Many stochastic search algorithms are designed to optimize a fixed objective function to learn a task, i.e., if the objective function changes slightly, for example, due to a change in the situation or context of the task, relearning is required to adapt to the new context. For instance, if we want to learn a kicking movement for a soccer robot, we have to relearn the movement for different ball locations. Such relearning is undesired as it is highly inefficient and many applications require a fast adaptation to a new context/situation. Therefore, we investigate contextual stochastic search algorithms
      that can learn multiple, similar tasks simultaneously. Current contextual stochastic search methods are based on policy search algorithms and suffer from premature convergence and the need for parameter tuning. In this paper, we extend the well known CMA-ES algorithm to the contextual setting and illustrate its performance on several contextual
      tasks. Our new algorithm, called contextual CMAES, leverages from contextual learning while it preserves all the features of standard CMA-ES such as stability, avoidance of premature convergence, step size control and a minimal amount of parameter tuning.},
      url = {http://eprints.lincoln.ac.uk/28141/}
      }

     

     

  • JMLR 2017: Non-parametric Policy Search with Limited Information Loss.

    Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the dataset. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value functions, which might lead to poor convergence or oscillations in the policy update. A more robust policy update can be obtained by limiting the information loss between successive state-action distributions. In this paper, we develop a policy search algorithm with policy updates that are both robust and non-parametric. Our method can learn non-parametric control policies for infinite horizon continuous Markov decision processes with non-linear and redundant sensory representations.
    We investigate how we can use approximations of the kernel function to reduce the time requirements of the demanding non-parametric computations. In our experiments, we show the strong performance of the proposed method, and how it can be approximated efficiently. Finally, we show that our algorithm can learn a real-robot underpowered swing-up task directly from image data.

    • H. van Hoof, G. Neumann, and J. Peters, “Non-parametric policy search with limited information loss,” Journal of Machine Learning Research, 2018.
      [BibTeX] [Abstract] [Download PDF]

      Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the dataset. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value functions, which might lead to poor convergence or oscillations in the policy update. A more robust policy update can be obtained by limiting the information loss between successive state-action distributions. In this paper, we develop a policy search algorithm with policy updates that are both robust and non-parametric. Our method can learn non-parametric control policies for infinite horizon continuous Markov decision processes with non-linear and redundant sensory representations. We investigate how we can use approximations of the kernel function to reduce the time requirements of the demanding non-parametric computations. In our experiments, we show the strong performance of the proposed method, and how it can be approximated effi- ciently. Finally, we show that our algorithm can learn a real-robot underpowered swing-up task directly from image data.

      @article{lirolem28020,
      author = {Herke van Hoof and Gerhard Neumann and Jan Peters},
      year = {2018},
      title = {Non-parametric policy search with limited information loss},
      publisher = {Journal of Machine Learning Research},
      journal = {Journal of Machine Learning Research},
      month = {December},
      keywords = {ARRAY(0x56147fc33978)},
      url = {http://eprints.lincoln.ac.uk/28020/},
      abstract = {Learning complex control policies from non-linear and redundant sensory input is an important
      challenge for reinforcement learning algorithms. Non-parametric methods that
      approximate values functions or transition models can address this problem, by adapting
      to the complexity of the dataset. Yet, many current non-parametric approaches rely on
      unstable greedy maximization of approximate value functions, which might lead to poor
      convergence or oscillations in the policy update. A more robust policy update can be obtained
      by limiting the information loss between successive state-action distributions. In this
      paper, we develop a policy search algorithm with policy updates that are both robust and
      non-parametric. Our method can learn non-parametric control policies for infinite horizon
      continuous Markov decision processes with non-linear and redundant sensory representations.
      We investigate how we can use approximations of the kernel function to reduce the
      time requirements of the demanding non-parametric computations. In our experiments, we
      show the strong performance of the proposed method, and how it can be approximated effi-
      ciently. Finally, we show that our algorithm can learn a real-robot underpowered swing-up
      task directly from image data.}
      }

  • ICML 2017: Local Bayesian Optimization

    Bayesian optimization is renowned for its sample efficiency but its application to higher dimensional tasks is impeded by its focus on global optimization. To scale to higher dimensional problems, we leverage the sample efficiency of Bayesian optimization in a local context. The optimization of the acquisition function is restricted to the vicinity of a Gaussian search distribution which is moved towards high value areas of the objective. The proposed informationtheoretic update of the search distribution results in a Bayesian interpretation of local stochastic search: the search distribution encodes prior knowledge on the optimum’s location and is weighted at each iteration by the likelihood of this location’s optimality. We demonstrate the effectiveness of our algorithm on several benchmark objective functions as well as a continuous robotic task in which an informative prior is obtained by imitation learning.

    • R. Akrour, D. Sorokin, J. Peters, and G. Neumann, “Local Bayesian optimization of motor skills,” in International Conference on Machine Learning (ICML), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Bayesian optimization is renowned for its sample efficiency but its application to higher dimensional tasks is impeded by its focus on global optimization. To scale to higher dimensional problems, we leverage the sample efficiency of Bayesian optimization in a local context. The optimization of the acquisition function is restricted to the vicinity of a Gaussian search distribution which is moved towards high value areas of the objective. The proposed informationtheoretic update of the search distribution results in a Bayesian interpretation of local stochastic search: the search distribution encodes prior knowledge on the optimum?s location and is weighted at each iteration by the likelihood of this location?s optimality. We demonstrate the effectiveness of our algorithm on several benchmark objective functions as well as a continuous robotic task in which an informative prior is obtained by imitation learning.

      @inproceedings{lirolem27902,
      author = {R. Akrour and D. Sorokin and J. Peters and G. Neumann},
      year = {2017},
      title = {Local Bayesian optimization of motor skills},
      month = {August},
      booktitle = {International Conference on Machine Learning (ICML)},
      keywords = {ARRAY(0x56147fc37070)},
      url = {http://eprints.lincoln.ac.uk/27902/},
      abstract = {Bayesian optimization is renowned for its sample
      efficiency but its application to higher dimensional
      tasks is impeded by its focus on global
      optimization. To scale to higher dimensional
      problems, we leverage the sample efficiency of
      Bayesian optimization in a local context. The
      optimization of the acquisition function is restricted
      to the vicinity of a Gaussian search distribution
      which is moved towards high value areas
      of the objective. The proposed informationtheoretic
      update of the search distribution results
      in a Bayesian interpretation of local stochastic
      search: the search distribution encodes prior
      knowledge on the optimum?s location and is
      weighted at each iteration by the likelihood of
      this location?s optimality. We demonstrate the
      effectiveness of our algorithm on several benchmark
      objective functions as well as a continuous
      robotic task in which an informative prior is obtained
      by imitation learning.}
      }

  • GECCO 2017: Deriving and Improving CMA-ES with Information-Geometric Trustregions

    CMA-ES is one of the most popular stochastic search algorithms. It performs favourably in many tasks without the need of extensive parameter tuning. The algorithm has many beneficial properties, including automatic step-size adaptation, efficient covariance updates that incorporates the current samples as well as the evolution path and its invariance properties. Its update rules are composed of well established heuristics where the theoretical foundations of some of these rules are also well understood. In this paper we will fully derive all CMA-ES update rules within the framework of expectation-maximisation-based stochastic search algorithms using information-geometric trust regions. We show that the use of the trust region results in similar updates to CMA-ES for the mean and the covariance matrix while it allows for the derivation of an improved update rule for the step-size. Our new algorithm, Trust-Region Covariance Matrix Adaptation Evolution Strategy (TR-CMA-ES) is fully derived from first order optimization principles and performs favourably in compare to standard CMA-ES algorithm.

    • A. Abdolmaleki, B. Price, N. Lau, L. P. Reis, and G. Neumann, “Deriving and improving CMA-ES with Information geometric trust regions,” in The Genetic and Evolutionary Computation Conference (GECCO 2017), 2017.
      [BibTeX] [Abstract] [Download PDF]

      CMA-ES is one of the most popular stochastic search algorithms. It performs favourably in many tasks without the need of extensive parameter tuning. The algorithm has many beneficial properties, including automatic step-size adaptation, efficient covariance updates that incorporates the current samples as well as the evolution path and its invariance properties. Its update rules are composed of well established heuristics where the theoretical foundations of some of these rules are also well understood. In this paper we will fully derive all CMA-ES update rules within the framework of expectation-maximisation-based stochastic search algorithms using information-geometric trust regions. We show that the use of the trust region results in similar updates to CMA-ES for the mean and the covariance matrix while it allows for the derivation of an improved update rule for the step-size. Our new algorithm, Trust-Region Covariance Matrix Adaptation Evolution Strategy (TR-CMA-ES) is fully derived from first order optimization principles and performs favourably in compare to standard CMA-ES algorithm.

      @inproceedings{lirolem27056,
      booktitle = {The Genetic and Evolutionary Computation Conference (GECCO 2017)},
      month = {July},
      author = {Abbas Abdolmaleki and Bob Price and Nuno Lau and Luis Paulo Reis and Gerhard Neumann},
      year = {2017},
      title = {Deriving and improving CMA-ES with Information geometric trust regions},
      keywords = {ARRAY(0x56147fc33918)},
      url = {http://eprints.lincoln.ac.uk/27056/},
      abstract = {CMA-ES is one of the most popular stochastic search algorithms.
      It performs favourably in many tasks without the need of extensive
      parameter tuning. The algorithm has many beneficial properties,
      including automatic step-size adaptation, efficient covariance updates
      that incorporates the current samples as well as the evolution
      path and its invariance properties. Its update rules are composed
      of well established heuristics where the theoretical foundations of
      some of these rules are also well understood. In this paper we
      will fully derive all CMA-ES update rules within the framework of
      expectation-maximisation-based stochastic search algorithms using
      information-geometric trust regions. We show that the use of the trust
      region results in similar updates to CMA-ES for the mean and the
      covariance matrix while it allows for the derivation of an improved
      update rule for the step-size. Our new algorithm, Trust-Region Covariance
      Matrix Adaptation Evolution Strategy (TR-CMA-ES) is
      fully derived from first order optimization principles and performs
      favourably in compare to standard CMA-ES algorithm.}
      }

  • ICAPS 2017: State-regularized policy search for linearized dynamical systems

    Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of feedbackcontrollers by taking advantage of local approximations of model dynamics and cost functions. Stability of the policy update is a major issue for these methods, rendering them hard to apply for highly nonlinear systems. Recent approaches combine classical Stochastic Optimal Control methods with information-theoretic bounds to control the step-size of the policy update and could even be used to train nonlinear deep control policies. These methods bound the relative entropy between the new and the old policy to ensure a stable policy update. However, despite the bound in policy space, the state distributions of two consecutive policies can still differ significantly, rendering the used local approximate models invalid. To alleviate this issue we propose enforcing a relative entropy constraint not only on the policy update, but also on the update of the state distribution, around which the dynamics and cost are being approximated. We present a derivation of the closed-form policy update and show that our approach outperforms related methods on two nonlinear and highly dynamic simulated systems.

    • H. Abdulsamad, O. Arenz, J. Peters, and G. Neumann, “State-regularized policy search for linearized dynamical systems,” in Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of feedbackcontrollers by taking advantage of local approximations of model dynamics and cost functions. Stability of the policy update is a major issue for these methods, rendering them hard to apply for highly nonlinear systems. Recent approaches combine classical Stochastic Optimal Control methods with information-theoretic bounds to control the step-size of the policy update and could even be used to train nonlinear deep control policies. These methods bound the relative entropy between the new and the old policy to ensure a stable policy update. However, despite the bound in policy space, the state distributions of two consecutive policies can still differ significantly, rendering the used local approximate models invalid. To alleviate this issue we propose enforcing a relative entropy constraint not only on the policy update, but also on the update of the state distribution, around which the dynamics and cost are being approximated. We present a derivation of the closed-form policy update and show that our approach outperforms related methods on two nonlinear and highly dynamic simulated systems.

      @inproceedings{lirolem27055,
      author = {Hany Abdulsamad and Oleg Arenz and Jan Peters and Gerhard Neumann},
      year = {2017},
      title = {State-regularized policy search for linearized dynamical systems},
      month = {June},
      booktitle = {Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS)},
      keywords = {ARRAY(0x56147fc522b8)},
      abstract = {Trajectory-Centric Reinforcement Learning and Trajectory
      Optimization methods optimize a sequence of feedbackcontrollers
      by taking advantage of local approximations of
      model dynamics and cost functions. Stability of the policy update
      is a major issue for these methods, rendering them hard
      to apply for highly nonlinear systems. Recent approaches
      combine classical Stochastic Optimal Control methods with
      information-theoretic bounds to control the step-size of the
      policy update and could even be used to train nonlinear deep
      control policies. These methods bound the relative entropy
      between the new and the old policy to ensure a stable policy
      update. However, despite the bound in policy space, the
      state distributions of two consecutive policies can still differ
      significantly, rendering the used local approximate models invalid.
      To alleviate this issue we propose enforcing a relative
      entropy constraint not only on the policy update, but also on
      the update of the state distribution, around which the dynamics
      and cost are being approximated. We present a derivation
      of the closed-form policy update and show that our approach
      outperforms related methods on two nonlinear and highly dynamic
      simulated systems.},
      url = {http://eprints.lincoln.ac.uk/27055/}
      }

  • ISER 2016: Experiments with hierarchical reinforcement learning of multiple grasping policies

    Robotic grasping has attracted considerable interest, but it still remains a challenging task. The data-driven approach is a promising solution to the robotic grasping problem; this approach leverages a grasp dataset and generalizes grasps for various objects. However, these methods often depend on the quality of the given datasets, which are not trivial to obtain with sufficient quality. Although reinforcement learning approaches have been recently used to achieve autonomous collection of grasp datasets, the existing algorithms are often limited to specific grasp types. In this paper, we present a framework for hierarchical reinforcement learning of grasping policies. In our framework, the lowerlevel hierarchy learns multiple grasp types, and the upper-level hierarchy learns a policy to select from the learned grasp types according to a point cloud of a new object. Through experiments, we validate that our approach learns grasping by constructing the grasp dataset autonomously. The experimental results show that our approach learns multiple grasping policies and generalizes the learned grasps by using local point cloud information.

    • T. Osa, J. Peters, and G. Neumann, “Experiments with hierarchical reinforcement learning of multiple grasping policies,” in Proceedings of the International Symposium on Experimental Robotics (ISER), 2016.
      [BibTeX] [Abstract] [Download PDF]

      Robotic grasping has attracted considerable interest, but it still remains a challenging task. The data-driven approach is a promising solution to the robotic grasping problem; this approach leverages a grasp dataset and generalizes grasps for various objects. However, these methods often depend on the quality of the given datasets, which are not trivial to obtain with sufficient quality. Although reinforcement learning approaches have been recently used to achieve autonomous collection of grasp datasets, the existing algorithms are often limited to specific grasp types. In this paper, we present a framework for hierarchical reinforcement learning of grasping policies. In our framework, the lowerlevel hierarchy learns multiple grasp types, and the upper-level hierarchy learns a policy to select from the learned grasp types according to a point cloud of a new object. Through experiments, we validate that our approach learns grasping by constructing the grasp dataset autonomously. The experimental results show that our approach learns multiple grasping policies and generalizes the learned grasps by using local point cloud information.

      @inproceedings{lirolem26735,
      booktitle = {Proceedings of the International Symposium on Experimental Robotics (ISER)},
      month = {April},
      author = {T. Osa and J. Peters and G. Neumann},
      year = {2016},
      title = {Experiments with hierarchical reinforcement learning of multiple grasping policies},
      abstract = {Robotic grasping has attracted considerable interest, but it
      still remains a challenging task. The data-driven approach is a promising
      solution to the robotic grasping problem; this approach leverages a
      grasp dataset and generalizes grasps for various objects. However, these
      methods often depend on the quality of the given datasets, which are not
      trivial to obtain with sufficient quality. Although reinforcement learning
      approaches have been recently used to achieve autonomous collection
      of grasp datasets, the existing algorithms are often limited to specific
      grasp types. In this paper, we present a framework for hierarchical reinforcement
      learning of grasping policies. In our framework, the lowerlevel
      hierarchy learns multiple grasp types, and the upper-level hierarchy
      learns a policy to select from the learned grasp types according to a point
      cloud of a new object. Through experiments, we validate that our approach
      learns grasping by constructing the grasp dataset autonomously.
      The experimental results show that our approach learns multiple grasping
      policies and generalizes the learned grasps by using local point cloud
      information.},
      url = {http://eprints.lincoln.ac.uk/26735/},
      keywords = {ARRAY(0x56147fc525d0)}
      }

    Video:

  • ICRA 2017: Empowered Skills

    Robot Reinforcement Learning (RL) algorithms return a policy that maximizes a global cumulative reward signal but typically do not create diverse behaviors. Hence, the policy will typically only capture a single solution of a task. However, many motor tasks have a large variety of solutions and the knowledge about these solutions can have several advantages. For example, in an adversarial setting such as robot table tennis, the lack of diversity renders the behavior predictable and hence easy to counter for the opponent. In an interactive setting such as learning from human feedback, an emphasis on diversity gives the human more opportunity for guiding the robot and to avoid the latter to be stuck in local optima of the task. In order to increase diversity of the learned behaviors, we leverage prior work on intrinsic motivation and empowerment. We derive a new intrinsic motivation signal by enriching the description of a task with an outcome space, representing interesting aspects of a sensorimotor stream. For example, in table tennis, the outcome space could be given by the return position and return ball speed. The intrinsic motivation is now given by the diversity of future outcomes, a concept also known as empowerment. We derive a new policy search algorithm that maximizes a trade-off between the extrinsic reward and this intrinsic motivation criterion. Experiments on a planar reaching task and simulated robot table tennis demonstrate that our algorithm can learn a diverse set of behaviors within the area of interest of the tasks.

    • A. Gabriel, R. Akrour, J. Peters, and G. Neumann, “Empowered skills,” in International Conference on Robotics and Automation (ICRA), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Robot Reinforcement Learning (RL) algorithms return a policy that maximizes a global cumulative reward signal but typically do not create diverse behaviors. Hence, the policy will typically only capture a single solution of a task. However, many motor tasks have a large variety of solutions and the knowledge about these solutions can have several advantages. For example, in an adversarial setting such as robot table tennis, the lack of diversity renders the behavior predictable and hence easy to counter for the opponent. In an interactive setting such as learning from human feedback, an emphasis on diversity gives the human more opportunity for guiding the robot and to avoid the latter to be stuck in local optima of the task. In order to increase diversity of the learned behaviors, we leverage prior work on intrinsic motivation and empowerment. We derive a new intrinsic motivation signal by enriching the description of a task with an outcome space, representing interesting aspects of a sensorimotor stream. For example, in table tennis, the outcome space could be given by the return position and return ball speed. The intrinsic motivation is now given by the diversity of future outcomes, a concept also known as empowerment. We derive a new policy search algorithm that maximizes a trade-off between the extrinsic reward and this intrinsic motivation criterion. Experiments on a planar reaching task and simulated robot table tennis demonstrate that our algorithm can learn a diverse set of behaviors within the area of interest of the tasks.

      @inproceedings{lirolem26736,
      title = {Empowered skills},
      year = {2017},
      author = {A. Gabriel and R. Akrour and J. Peters and G. Neumann},
      month = {May},
      booktitle = {International Conference on Robotics and Automation (ICRA)},
      url = {http://eprints.lincoln.ac.uk/26736/},
      abstract = {Robot Reinforcement Learning (RL) algorithms
      return a policy that maximizes a global cumulative reward
      signal but typically do not create diverse behaviors. Hence, the
      policy will typically only capture a single solution of a task.
      However, many motor tasks have a large variety of solutions
      and the knowledge about these solutions can have several
      advantages. For example, in an adversarial setting such as
      robot table tennis, the lack of diversity renders the behavior
      predictable and hence easy to counter for the opponent. In an
      interactive setting such as learning from human feedback, an
      emphasis on diversity gives the human more opportunity for
      guiding the robot and to avoid the latter to be stuck in local
      optima of the task. In order to increase diversity of the learned
      behaviors, we leverage prior work on intrinsic motivation and
      empowerment. We derive a new intrinsic motivation signal by
      enriching the description of a task with an outcome space,
      representing interesting aspects of a sensorimotor stream. For
      example, in table tennis, the outcome space could be given
      by the return position and return ball speed. The intrinsic
      motivation is now given by the diversity of future outcomes,
      a concept also known as empowerment. We derive a new
      policy search algorithm that maximizes a trade-off between
      the extrinsic reward and this intrinsic motivation criterion.
      Experiments on a planar reaching task and simulated robot
      table tennis demonstrate that our algorithm can learn a diverse
      set of behaviors within the area of interest of the tasks.},
      keywords = {ARRAY(0x56147fc52330)}
      }

  • ICRA 2017: Layered Direct Policy Search for Learning Hierarchical Skills

    Solutions to real world robotic tasks often require complex behaviors in high dimensional continuous state and action spaces. Reinforcement Learning (RL) is aimed at learning such behaviors but often fails for lack of scalability. To address this issue, Hierarchical RL (HRL) algorithms leverage hierarchical policies to exploit the structure of a task. However, many HRL algorithms rely on task specific knowledge such as a set of predefined sub-policies or sub-goals. In this paper we propose a new HRL algorithm based on information theoretic principles to autonomously uncover a diverse set of sub-policies and their activation policies. Moreover, the learning process mirrors the policys structure and is thus also hierarchical, consisting of a set of independent optimization problems. The hierarchical structure of the learning process allows us to control the learning rate of the sub-policies and the gating individually and add specific information theoretic constraints to each layer to ensure the diversification of the subpolicies. We evaluate our algorithm on two high dimensional continuous tasks and experimentally demonstrate its ability to autonomously discover a rich set of sub-policies.

    • F. End, R. Akrour, J. Peters, and G. Neumann, “Layered direct policy search for learning hierarchical skills,” in International Conference on Robotics and Automation (ICRA), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Solutions to real world robotic tasks often require complex behaviors in high dimensional continuous state and action spaces. Reinforcement Learning (RL) is aimed at learning such behaviors but often fails for lack of scalability. To address this issue, Hierarchical RL (HRL) algorithms leverage hierarchical policies to exploit the structure of a task. However, many HRL algorithms rely on task specific knowledge such as a set of predefined sub-policies or sub-goals. In this paper we propose a new HRL algorithm based on information theoretic principles to autonomously uncover a diverse set of sub-policies and their activation policies. Moreover, the learning process mirrors the policys structure and is thus also hierarchical, consisting of a set of independent optimization problems. The hierarchical structure of the learning process allows us to control the learning rate of the sub-policies and the gating individually and add specific information theoretic constraints to each layer to ensure the diversification of the subpolicies. We evaluate our algorithm on two high dimensional continuous tasks and experimentally demonstrate its ability to autonomously discover a rich set of sub-policies.

      @inproceedings{lirolem26737,
      booktitle = {International Conference on Robotics and Automation (ICRA)},
      month = {May},
      author = {F. End and R. Akrour and J. Peters and G. Neumann},
      year = {2017},
      title = {Layered direct policy search for learning hierarchical skills},
      keywords = {ARRAY(0x56147fc522d0)},
      url = {http://eprints.lincoln.ac.uk/26737/},
      abstract = {Solutions to real world robotic tasks often require
      complex behaviors in high dimensional continuous state and
      action spaces. Reinforcement Learning (RL) is aimed at learning
      such behaviors but often fails for lack of scalability. To
      address this issue, Hierarchical RL (HRL) algorithms leverage
      hierarchical policies to exploit the structure of a task. However,
      many HRL algorithms rely on task specific knowledge such
      as a set of predefined sub-policies or sub-goals. In this paper
      we propose a new HRL algorithm based on information
      theoretic principles to autonomously uncover a diverse set
      of sub-policies and their activation policies. Moreover, the
      learning process mirrors the policys structure and is thus also
      hierarchical, consisting of a set of independent optimization
      problems. The hierarchical structure of the learning process
      allows us to control the learning rate of the sub-policies and
      the gating individually and add specific information theoretic
      constraints to each layer to ensure the diversification of the subpolicies.
      We evaluate our algorithm on two high dimensional
      continuous tasks and experimentally demonstrate its ability to
      autonomously discover a rich set of sub-policies.}
      }

  • ICML 2016: Model-Free Trajectory Optimization for Reinforcement Learning

    Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy.
    In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

    • R. Akrour, A. Abdolmaleki, H. Abdulsamad, and G. Neumann, “Model-free trajectory optimization for reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2016, pp. 4342-4352.
      [BibTeX] [Abstract] [Download PDF]

      Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

      @inproceedings{lirolem25747,
      year = {2016},
      title = {Model-free trajectory optimization for reinforcement learning},
      author = {R. Akrour and A. Abdolmaleki and H. Abdulsamad and G. Neumann},
      pages = {4342--4352},
      month = {June},
      journal = {33rd International Conference on Machine Learning, ICML 2016},
      volume = {6},
      booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
      abstract = {Many of the recent Trajectory Optimization algorithms
      alternate between local approximation
      of the dynamics and conservative policy update.
      However, linearly approximating the dynamics
      in order to derive the new policy can bias the update
      and prevent convergence to the optimal policy.
      In this article, we propose a new model-free
      algorithm that backpropagates a local quadratic
      time-dependent Q-Function, allowing the derivation
      of the policy update in closed form. Our policy
      update ensures exact KL-constraint satisfaction
      without simplifying assumptions on the system
      dynamics demonstrating improved performance
      in comparison to related Trajectory Optimization
      algorithms linearizing the dynamics.},
      url = {http://eprints.lincoln.ac.uk/25747/},
      keywords = {ARRAY(0x56147fc52480)}
      }