Research

Research Fields

Robots that have to operate in real world environments need to perform a huge variety of different skills at a high level of dexterity. Preprogramming these skills for unpredictable environments seems to be infeasible. We investigate computational learning algorithms that allow artificial agents to autonomously learn new skills from interaction with the environment, humans or other agents. We believe that such autonomously learning agents will have a great impact in many areas of everyday’s life, including service robotics that help in the household or with care of the elderly, manufacturing, agri-culture robotics or the disposal of dangerous material such as nuclear waste.

An autonomously learning agent has to acquire a rich set of different behaviors to achieve a variety of goals. The agent has to learn autonomously how to explore its environment and determine which are the important features that need to be considered for making a decision. It has to identify relevant behaviours and needs to determine when to learn new behaviours. Furthermore, the robot needs to learn what are relevant goals and how to re-use behaviours in order to achieve new goals. It needs to be easily teachable by human laymen and collaborate with them. Moreover, in many applications, several robotic agents need to be coordinated.

Our research concentrates on the following sub-fields of machine learning:

Selected Papers

  • RAL & IROS 2017: Probabilistic prioritization of movement primitives

    Movement prioritization is a common approach to combine controllers of different tasks for redundant robots, where each task is assigned a priority. The priorities of the tasks are often hand-tuned or the result of an optimization, but seldomly learned from data. This paper combines Bayesian task prioritization with probabilistic movement primitives to prioritize full motion sequences that are learned from demonstrations. Probabilistic movement primitives (ProMPs) can encode distributions of movements over full motion sequences and provide control laws to exactly follow these distributions. The probabilistic formulation allows for a natural application of Bayesian task prioritization. We extend the ProMP controllers with an additional feedback component that accounts inaccuracies in following the distribution and allows for a more robust prioritization of primitives. We demonstrate how the task priorities can be obtained from imitation learning and how different primitives can be combined to solve even unseen task-combinations. Due to the prioritization, our  approach can efficiently learn a combination of tasks without requiring individual models per task combination. Further, our approach can adapt an existing primitive library by prioritizing additional controllers, for example, for implementing obstacle avoidance.
    Hence, the need of retraining the whole library is avoided in many cases. We evaluate our approach on reaching movements under constraints with redundant simulated planar robots and two physical robot platforms, the humanoid robot “iCub” and a KUKA LWR robot arm.

    • A. Paraschos, R. Lioutikov, J. Peters, and G. Neumann, “Probabilistic prioritization of movement primitives,” IEEE Robotics and Automation Letters, vol. PP, iss. 99, 2017.
      [BibTeX] [Abstract] [Download PDF]

      Movement prioritization is a common approach to combine controllers of different tasks for redundant robots, where each task is assigned a priority. The priorities of the tasks are often hand-tuned or the result of an optimization, but seldomly learned from data. This paper combines Bayesian task prioritization with probabilistic movement primitives to prioritize full motion sequences that are learned from demonstrations. Probabilistic movement primitives (ProMPs) can encode distributions of movements over full motion sequences and provide control laws to exactly follow these distributions. The probabilistic formulation allows for a natural application of Bayesian task prioritization. We extend the ProMP controllers with an additional feedback component that accounts inaccuracies in following the distribution and allows for a more robust prioritization of primitives. We demonstrate how the task priorities can be obtained from imitation learning and how different primitives can be combined to solve even unseen task-combinations. Due to the prioritization, our approach can efficiently learn a combination of tasks without requiring individual models per task combination. Further, our approach can adapt an existing primitive library by prioritizing additional controllers, for example, for implementing obstacle avoidance. Hence, the need of retraining the whole library is avoided in many cases. We evaluate our approach on reaching movements under constraints with redundant simulated planar robots and two physical robot platforms, the humanoid robot ?iCub? and a KUKA LWR robot arm.

      @article{lirolem27901,
      booktitle = {, Proceedings of the International Conference on Intelligent Robot Systems, and IEEE Robotics and Automation Letters (RA-L)},
      number = {99},
      author = {Alexandros Paraschos and Rudolf Lioutikov and Jan Peters and Gerhard Neumann},
      publisher = {IEEE},
      title = {Probabilistic prioritization of movement primitives},
      month = {July},
      journal = {IEEE Robotics and Automation Letters},
      year = {2017},
      volume = {PP},
      url = {http://eprints.lincoln.ac.uk/27901/},
      abstract = {Movement prioritization is a common approach
      to combine controllers of different tasks for redundant robots,
      where each task is assigned a priority. The priorities of the
      tasks are often hand-tuned or the result of an optimization,
      but seldomly learned from data. This paper combines Bayesian
      task prioritization with probabilistic movement primitives to
      prioritize full motion sequences that are learned from demonstrations.
      Probabilistic movement primitives (ProMPs) can
      encode distributions of movements over full motion sequences
      and provide control laws to exactly follow these distributions.
      The probabilistic formulation allows for a natural application of
      Bayesian task prioritization. We extend the ProMP controllers
      with an additional feedback component that accounts inaccuracies
      in following the distribution and allows for a more
      robust prioritization of primitives. We demonstrate how the
      task priorities can be obtained from imitation learning and
      how different primitives can be combined to solve even unseen
      task-combinations. Due to the prioritization, our approach can
      efficiently learn a combination of tasks without requiring individual
      models per task combination. Further, our approach can
      adapt an existing primitive library by prioritizing additional
      controllers, for example, for implementing obstacle avoidance.
      Hence, the need of retraining the whole library is avoided in
      many cases. We evaluate our approach on reaching movements
      under constraints with redundant simulated planar robots and
      two physical robot platforms, the humanoid robot ?iCub? and
      a KUKA LWR robot arm.},
      keywords = {ARRAY(0x56147f372948)}
      }

  • IJRR 2017: Phase estimation for fast action recognition and trajectory generation in human–robot collaboration

    This paper proposes a method to achieve fast and fluid human–robot interaction by estimating the progress of the movement of the human. The method allows the progress, also referred to as the phase of the movement, to be estimated even when observations of the human are partial and occluded; a problem typically found when using motion capture systems in cluttered environments. By leveraging on the framework of Interaction Probabilistic Movement Primitives, phase estimation makes it possible to classify the human action, and to generate a corresponding robot trajectory before the human finishes his/her movement. The method is therefore suited for semi-autonomous robots acting as assistants and coworkers. Since observations may be sparse, our method is based on computing the probability of different phase candidates to find the phase that best aligns the Interaction Probabilistic Movement Primitives with the current observations. The method is fundamentally different from approaches based on Dynamic Time Warping that must rely on a consistent stream of measurements at runtime. The resulting framework can achieve phase estimation, action recognition and robot trajectory coordination using a single probabilistic representation. We evaluated the method using a seven-degree-of-freedom lightweight robot arm equipped with a five-finger hand in single and multi-task collaborative experiments. We compare the accuracy achieved by phase estimation with our previous method based on dynamic time warping.

    • G. Maeda, M. Ewerton, G. Neumann, R. Lioutikov, and J. Peters, “Phase estimation for fast action recognition and trajectory generation in human?robot collaboration,” The International Journal of Robotics Research, vol. 36, iss. 13-14, pp. 1579-1594, 2017.
      [BibTeX] [Abstract] [Download PDF]

      This paper proposes a method to achieve fast and fluid human?robot interaction by estimating the progress of the movement of the human. The method allows the progress, also referred to as the phase of the movement, to be estimated even when observations of the human are partial and occluded; a problem typically found when using motion capture systems in cluttered environments. By leveraging on the framework of Interaction Probabilistic Movement Primitives, phase estimation makes it possible to classify the human action, and to generate a corresponding robot trajectory before the human finishes his/her movement. The method is therefore suited for semi-autonomous robots acting as assistants and coworkers. Since observations may be sparse, our method is based on computing the probability of different phase candidates to find the phase that best aligns the Interaction Probabilistic Movement Primitives with the current observations. The method is fundamentally different from approaches based on Dynamic Time Warping that must rely on a consistent stream of measurements at runtime. The resulting framework can achieve phase estimation, action recognition and robot trajectory coordination using a single probabilistic representation. We evaluated the method using a seven-degree-of-freedom lightweight robot arm equipped with a five-finger hand in single and multi-task collaborative experiments. We compare the accuracy achieved by phase estimation with our previous method based on dynamic time warping.

      @article{lirolem26734,
      title = {Phase estimation for fast action recognition and trajectory generation in human?robot collaboration},
      month = {December},
      pages = {1579--1594},
      author = {Guilherme Maeda and Marco Ewerton and Gerhard Neumann and Rudolf Lioutikov and Jan Peters},
      publisher = {SAGE},
      number = {13-14},
      volume = {36},
      year = {2017},
      journal = {The International Journal of Robotics Research},
      keywords = {ARRAY(0x56147f3729d8)},
      url = {http://eprints.lincoln.ac.uk/26734/},
      abstract = {This paper proposes a method to achieve fast and fluid human?robot interaction by estimating the progress of the movement of the human. The method allows the progress, also referred to as the phase of the movement, to be estimated even when observations of the human are partial and occluded; a problem typically found when using motion capture systems in cluttered environments. By leveraging on the framework of Interaction Probabilistic Movement Primitives, phase estimation makes it possible to classify the human action, and to generate a corresponding robot trajectory before the human finishes his/her movement. The method is therefore suited for semi-autonomous robots acting as assistants and coworkers. Since observations may be sparse, our method is based on computing the probability of different phase candidates to find the phase that best aligns the Interaction Probabilistic Movement Primitives with the current observations. The method is fundamentally different from approaches based on Dynamic Time Warping that must rely on a consistent stream of measurements at runtime. The resulting framework can achieve phase estimation, action recognition and robot trajectory coordination using a single probabilistic representation. We evaluated the method using a seven-degree-of-freedom lightweight robot arm equipped with a five-finger hand in single and multi-task collaborative experiments. We compare the accuracy achieved by phase estimation with our previous method based on dynamic time warping.}
      }

  • RAL 2017: Guiding trajectory optimization by demonstrated distributions

    Trajectory optimization is an essential tool for motion planning under multiple constraints of robotic manipulators. Optimization-based methods can explicitly optimize a trajectory by leveraging prior knowledge of the system and have been used in various applications such as collision avoidance. However, these methods often require a hand-coded cost function in order to achieve the desired behavior. Specifying such cost function for a complex desired behavior, e.g., disentangling a rope, is a nontrivial task that is often even infeasible. Learning from demonstration (LfD) methods offer an alternative way to program robot motion. LfD methods are less dependent on analytical models and instead learn the behavior of experts implicitly from the demonstrated trajectories. However, the problem of adapting the demonstrations to new situations, e.g., avoiding newly introduced obstacles, has not been fully investigated in the literature. In this paper, we present a motion planning framework that combines the advantages of optimization-based and demonstration-based methods. We learn a distribution of trajectories demonstrated by human experts and use it to guide the trajectory optimization
    process. The resulting trajectory maintains the demonstrated behaviors, which are essential to performing the task successfully, while adapting the trajectory to avoid obstacles. In simulated experiments and with a real robotic system, we verify that our approach optimizes the trajectory to avoid obstacles and encodes the demonstrated behavior in the resulting trajectory

    • T. Osa, A. G. M. Esfahani, R. Stolkin, R. Lioutikov, J. Peters, and G. Neumann, “Guiding trajectory optimization by demonstrated distributions,” IEEE Robotics and Automation Letters (RA-L), vol. 2, iss. 2, pp. 819-826, 2017.
      [BibTeX] [Abstract] [Download PDF]

      Trajectory optimization is an essential tool for motion planning under multiple constraints of robotic manipulators. Optimization-based methods can explicitly optimize a trajectory by leveraging prior knowledge of the system and have been used in various applications such as collision avoidance. However, these methods often require a hand-coded cost function in order to achieve the desired behavior. Specifying such cost function for a complex desired behavior, e.g., disentangling a rope, is a nontrivial task that is often even infeasible. Learning from demonstration (LfD) methods offer an alternative way to program robot motion. LfD methods are less dependent on analytical models and instead learn the behavior of experts implicitly from the demonstrated trajectories. However, the problem of adapting the demonstrations to new situations, e.g., avoiding newly introduced obstacles, has not been fully investigated in the literature. In this paper, we present a motion planning framework that combines the advantages of optimization-based and demonstration-based methods. We learn a distribution of trajectories demonstrated by human experts and use it to guide the trajectory optimization process. The resulting trajectory maintains the demonstrated behaviors, which are essential to performing the task successfully, while adapting the trajectory to avoid obstacles. In simulated experiments and with a real robotic system, we verify that our approach optimizes the trajectory to avoid obstacles and encodes the demonstrated behavior in the resulting trajectory

      @article{lirolem26731,
      number = {2},
      publisher = {IEEE},
      author = {Takayuki Osa and Amir M. Ghalamzan Esfahani and Rustam Stolkin and Rudolf Lioutikov and Jan Peters and Gerhard Neumann},
      month = {January},
      title = {Guiding trajectory optimization by demonstrated distributions},
      pages = {819--826},
      year = {2017},
      journal = {IEEE Robotics and Automation Letters (RA-L)},
      volume = {2},
      abstract = {Trajectory optimization is an essential tool for motion
      planning under multiple constraints of robotic manipulators.
      Optimization-based methods can explicitly optimize a trajectory
      by leveraging prior knowledge of the system and have been used
      in various applications such as collision avoidance. However, these
      methods often require a hand-coded cost function in order to
      achieve the desired behavior. Specifying such cost function for
      a complex desired behavior, e.g., disentangling a rope, is a nontrivial
      task that is often even infeasible. Learning from demonstration
      (LfD) methods offer an alternative way to program robot
      motion. LfD methods are less dependent on analytical models
      and instead learn the behavior of experts implicitly from the
      demonstrated trajectories. However, the problem of adapting the
      demonstrations to new situations, e.g., avoiding newly introduced
      obstacles, has not been fully investigated in the literature. In this
      paper, we present a motion planning framework that combines
      the advantages of optimization-based and demonstration-based
      methods. We learn a distribution of trajectories demonstrated by
      human experts and use it to guide the trajectory optimization
      process. The resulting trajectory maintains the demonstrated
      behaviors, which are essential to performing the task successfully,
      while adapting the trajectory to avoid obstacles. In simulated
      experiments and with a real robotic system, we verify that our
      approach optimizes the trajectory to avoid obstacles and encodes
      the demonstrated behavior in the resulting trajectory},
      url = {http://eprints.lincoln.ac.uk/26731/},
      keywords = {ARRAY(0x56147f374b58)}
      }

    Video:

  • Auro 2017: Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks

    This paper proposes an interaction learning method for collaborative and assistive robots based on movement primitives. The method allows for both action recognition and human–robot movement coordination. It uses imitation learning to construct a mixture model of human–robot interaction primitives. This probabilistic model allows the assistive trajectory of the robot to be inferred from human observations. The method is scalable in relation to the number of tasks and can learn nonlinear correlations between the trajectories that describe the human–robot interaction. We evaluated the method experimentally with a lightweight robot arm in a variety of assistive scenarios, including the coordinated handover of a bottle to a human, and the collaborative assembly of a toolbox. Potential applications of the method are personal caregiver robots, control of intelligent prosthetic devices, and robot coworkers in factories.

    • G. J. Maeda, G. Neumann, M. Ewerton, R. Lioutikov, O. Kroemer, and J. Peters, “Probabilistic movement primitives for coordination of multiple human?robot collaborative tasks,” Autonomous Robots, vol. 41, iss. 3, pp. 593-612, 2017.
      [BibTeX] [Abstract] [Download PDF]

      This paper proposes an interaction learning method for collaborative and assistive robots based on movement primitives. The method allows for both action recognition and human?robot movement coordination. It uses imitation learning to construct a mixture model of human?robot interaction primitives. This probabilistic model allows the assistive trajectory of the robot to be inferred from human observations. The method is scalable in relation to the number of tasks and can learn nonlinear correlations between the trajectories that describe the human?robot interaction. We evaluated the method experimentally with a lightweight robot arm in a variety of assistive scenarios, including the coordinated handover of a bottle to a human, and the collaborative assembly of a toolbox. Potential applications of the method are personal caregiver robots, control of intelligent prosthetic devices, and robot coworkers in factories.

      @article{lirolem25744,
      pages = {593--612},
      month = {March},
      title = {Probabilistic movement primitives for coordination of multiple human?robot collaborative tasks},
      publisher = {Springer},
      author = {G. J. Maeda and G. Neumann and M. Ewerton and R. Lioutikov and O. Kroemer and J. Peters},
      number = {3},
      volume = {41},
      journal = {Autonomous Robots},
      note = {Special Issue on Assistive and Rehabilitation Robotics},
      year = {2017},
      url = {http://eprints.lincoln.ac.uk/25744/},
      abstract = {This paper proposes an interaction learning method for collaborative and assistive robots based on movement primitives. The method allows for both action recognition and human?robot movement coordination. It uses imitation learning to construct a mixture model of human?robot interaction primitives. This probabilistic model allows the assistive trajectory of the robot to be inferred from human observations. The method is scalable in relation to the number of tasks and can learn nonlinear correlations between the trajectories that describe the human?robot interaction. We evaluated the method experimentally with a lightweight robot arm in a variety of assistive scenarios, including the coordinated handover of a bottle to a human, and the collaborative assembly of a toolbox. Potential applications of the method are personal caregiver robots, control of intelligent prosthetic devices, and robot coworkers in factories.},
      keywords = {ARRAY(0x56147f374ac8)}
      }

  • ISER 2016: Experiments with hierarchical reinforcement learning of multiple grasping policies

    Robotic grasping has attracted considerable interest, but it still remains a challenging task. The data-driven approach is a promising solution to the robotic grasping problem; this approach leverages a grasp dataset and generalizes grasps for various objects. However, these methods often depend on the quality of the given datasets, which are not trivial to obtain with sufficient quality. Although reinforcement learning approaches have been recently used to achieve autonomous collection of grasp datasets, the existing algorithms are often limited to specific grasp types. In this paper, we present a framework for hierarchical reinforcement learning of grasping policies. In our framework, the lowerlevel hierarchy learns multiple grasp types, and the upper-level hierarchy learns a policy to select from the learned grasp types according to a point cloud of a new object. Through experiments, we validate that our approach learns grasping by constructing the grasp dataset autonomously. The experimental results show that our approach learns multiple grasping policies and generalizes the learned grasps by using local point cloud information.

    • T. Osa, J. Peters, and G. Neumann, “Experiments with hierarchical reinforcement learning of multiple grasping policies,” in Proceedings of the International Symposium on Experimental Robotics (ISER), 2016.
      [BibTeX] [Abstract] [Download PDF]

      Robotic grasping has attracted considerable interest, but it still remains a challenging task. The data-driven approach is a promising solution to the robotic grasping problem; this approach leverages a grasp dataset and generalizes grasps for various objects. However, these methods often depend on the quality of the given datasets, which are not trivial to obtain with sufficient quality. Although reinforcement learning approaches have been recently used to achieve autonomous collection of grasp datasets, the existing algorithms are often limited to specific grasp types. In this paper, we present a framework for hierarchical reinforcement learning of grasping policies. In our framework, the lowerlevel hierarchy learns multiple grasp types, and the upper-level hierarchy learns a policy to select from the learned grasp types according to a point cloud of a new object. Through experiments, we validate that our approach learns grasping by constructing the grasp dataset autonomously. The experimental results show that our approach learns multiple grasping policies and generalizes the learned grasps by using local point cloud information.

      @inproceedings{lirolem26735,
      author = {T. Osa and J. Peters and G. Neumann},
      title = {Experiments with hierarchical reinforcement learning of multiple grasping policies},
      month = {April},
      booktitle = {Proceedings of the International Symposium on Experimental Robotics (ISER)},
      year = {2016},
      keywords = {ARRAY(0x56147f374d38)},
      url = {http://eprints.lincoln.ac.uk/26735/},
      abstract = {Robotic grasping has attracted considerable interest, but it
      still remains a challenging task. The data-driven approach is a promising
      solution to the robotic grasping problem; this approach leverages a
      grasp dataset and generalizes grasps for various objects. However, these
      methods often depend on the quality of the given datasets, which are not
      trivial to obtain with sufficient quality. Although reinforcement learning
      approaches have been recently used to achieve autonomous collection
      of grasp datasets, the existing algorithms are often limited to specific
      grasp types. In this paper, we present a framework for hierarchical reinforcement
      learning of grasping policies. In our framework, the lowerlevel
      hierarchy learns multiple grasp types, and the upper-level hierarchy
      learns a policy to select from the learned grasp types according to a point
      cloud of a new object. Through experiments, we validate that our approach
      learns grasping by constructing the grasp dataset autonomously.
      The experimental results show that our approach learns multiple grasping
      policies and generalizes the learned grasps by using local point cloud
      information.}
      }

    Video:

  • ICRA 2017: Empowered Skills

    Robot Reinforcement Learning (RL) algorithms return a policy that maximizes a global cumulative reward signal but typically do not create diverse behaviors. Hence, the policy will typically only capture a single solution of a task. However, many motor tasks have a large variety of solutions and the knowledge about these solutions can have several advantages. For example, in an adversarial setting such as robot table tennis, the lack of diversity renders the behavior predictable and hence easy to counter for the opponent. In an interactive setting such as learning from human feedback, an emphasis on diversity gives the human more opportunity for guiding the robot and to avoid the latter to be stuck in local optima of the task. In order to increase diversity of the learned behaviors, we leverage prior work on intrinsic motivation and empowerment. We derive a new intrinsic motivation signal by enriching the description of a task with an outcome space, representing interesting aspects of a sensorimotor stream. For example, in table tennis, the outcome space could be given by the return position and return ball speed. The intrinsic motivation is now given by the diversity of future outcomes, a concept also known as empowerment. We derive a new policy search algorithm that maximizes a trade-off between the extrinsic reward and this intrinsic motivation criterion. Experiments on a planar reaching task and simulated robot table tennis demonstrate that our algorithm can learn a diverse set of behaviors within the area of interest of the tasks.

    • A. Gabriel, R. Akrour, J. Peters, and G. Neumann, “Empowered skills,” in International Conference on Robotics and Automation (ICRA), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Robot Reinforcement Learning (RL) algorithms return a policy that maximizes a global cumulative reward signal but typically do not create diverse behaviors. Hence, the policy will typically only capture a single solution of a task. However, many motor tasks have a large variety of solutions and the knowledge about these solutions can have several advantages. For example, in an adversarial setting such as robot table tennis, the lack of diversity renders the behavior predictable and hence easy to counter for the opponent. In an interactive setting such as learning from human feedback, an emphasis on diversity gives the human more opportunity for guiding the robot and to avoid the latter to be stuck in local optima of the task. In order to increase diversity of the learned behaviors, we leverage prior work on intrinsic motivation and empowerment. We derive a new intrinsic motivation signal by enriching the description of a task with an outcome space, representing interesting aspects of a sensorimotor stream. For example, in table tennis, the outcome space could be given by the return position and return ball speed. The intrinsic motivation is now given by the diversity of future outcomes, a concept also known as empowerment. We derive a new policy search algorithm that maximizes a trade-off between the extrinsic reward and this intrinsic motivation criterion. Experiments on a planar reaching task and simulated robot table tennis demonstrate that our algorithm can learn a diverse set of behaviors within the area of interest of the tasks.

      @inproceedings{lirolem26736,
      title = {Empowered skills},
      month = {May},
      author = {A. Gabriel and R. Akrour and J. Peters and G. Neumann},
      year = {2017},
      booktitle = {International Conference on Robotics and Automation (ICRA)},
      abstract = {Robot Reinforcement Learning (RL) algorithms
      return a policy that maximizes a global cumulative reward
      signal but typically do not create diverse behaviors. Hence, the
      policy will typically only capture a single solution of a task.
      However, many motor tasks have a large variety of solutions
      and the knowledge about these solutions can have several
      advantages. For example, in an adversarial setting such as
      robot table tennis, the lack of diversity renders the behavior
      predictable and hence easy to counter for the opponent. In an
      interactive setting such as learning from human feedback, an
      emphasis on diversity gives the human more opportunity for
      guiding the robot and to avoid the latter to be stuck in local
      optima of the task. In order to increase diversity of the learned
      behaviors, we leverage prior work on intrinsic motivation and
      empowerment. We derive a new intrinsic motivation signal by
      enriching the description of a task with an outcome space,
      representing interesting aspects of a sensorimotor stream. For
      example, in table tennis, the outcome space could be given
      by the return position and return ball speed. The intrinsic
      motivation is now given by the diversity of future outcomes,
      a concept also known as empowerment. We derive a new
      policy search algorithm that maximizes a trade-off between
      the extrinsic reward and this intrinsic motivation criterion.
      Experiments on a planar reaching task and simulated robot
      table tennis demonstrate that our algorithm can learn a diverse
      set of behaviors within the area of interest of the tasks.},
      url = {http://eprints.lincoln.ac.uk/26736/},
      keywords = {ARRAY(0x56147f374a98)}
      }

  • ICRA 2017: Layered Direct Policy Search for Learning Hierarchical Skills

    Solutions to real world robotic tasks often require complex behaviors in high dimensional continuous state and action spaces. Reinforcement Learning (RL) is aimed at learning such behaviors but often fails for lack of scalability. To address this issue, Hierarchical RL (HRL) algorithms leverage hierarchical policies to exploit the structure of a task. However, many HRL algorithms rely on task specific knowledge such as a set of predefined sub-policies or sub-goals. In this paper we propose a new HRL algorithm based on information theoretic principles to autonomously uncover a diverse set of sub-policies and their activation policies. Moreover, the learning process mirrors the policys structure and is thus also hierarchical, consisting of a set of independent optimization problems. The hierarchical structure of the learning process allows us to control the learning rate of the sub-policies and the gating individually and add specific information theoretic constraints to each layer to ensure the diversification of the subpolicies. We evaluate our algorithm on two high dimensional continuous tasks and experimentally demonstrate its ability to autonomously discover a rich set of sub-policies.

    • F. End, R. Akrour, J. Peters, and G. Neumann, “Layered direct policy search for learning hierarchical skills,” in International Conference on Robotics and Automation (ICRA), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Solutions to real world robotic tasks often require complex behaviors in high dimensional continuous state and action spaces. Reinforcement Learning (RL) is aimed at learning such behaviors but often fails for lack of scalability. To address this issue, Hierarchical RL (HRL) algorithms leverage hierarchical policies to exploit the structure of a task. However, many HRL algorithms rely on task specific knowledge such as a set of predefined sub-policies or sub-goals. In this paper we propose a new HRL algorithm based on information theoretic principles to autonomously uncover a diverse set of sub-policies and their activation policies. Moreover, the learning process mirrors the policys structure and is thus also hierarchical, consisting of a set of independent optimization problems. The hierarchical structure of the learning process allows us to control the learning rate of the sub-policies and the gating individually and add specific information theoretic constraints to each layer to ensure the diversification of the subpolicies. We evaluate our algorithm on two high dimensional continuous tasks and experimentally demonstrate its ability to autonomously discover a rich set of sub-policies.

      @inproceedings{lirolem26737,
      booktitle = {International Conference on Robotics and Automation (ICRA)},
      year = {2017},
      author = {F. End and R. Akrour and J. Peters and G. Neumann},
      month = {May},
      title = {Layered direct policy search for learning hierarchical skills},
      url = {http://eprints.lincoln.ac.uk/26737/},
      abstract = {Solutions to real world robotic tasks often require
      complex behaviors in high dimensional continuous state and
      action spaces. Reinforcement Learning (RL) is aimed at learning
      such behaviors but often fails for lack of scalability. To
      address this issue, Hierarchical RL (HRL) algorithms leverage
      hierarchical policies to exploit the structure of a task. However,
      many HRL algorithms rely on task specific knowledge such
      as a set of predefined sub-policies or sub-goals. In this paper
      we propose a new HRL algorithm based on information
      theoretic principles to autonomously uncover a diverse set
      of sub-policies and their activation policies. Moreover, the
      learning process mirrors the policys structure and is thus also
      hierarchical, consisting of a set of independent optimization
      problems. The hierarchical structure of the learning process
      allows us to control the learning rate of the sub-policies and
      the gating individually and add specific information theoretic
      constraints to each layer to ensure the diversification of the subpolicies.
      We evaluate our algorithm on two high dimensional
      continuous tasks and experimentally demonstrate its ability to
      autonomously discover a rich set of sub-policies.},
      keywords = {ARRAY(0x56147f374a38)}
      }

  • ICML 2016: Model-Free Trajectory Optimization for Reinforcement Learning

    Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy.
    In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

    • R. Akrour, A. Abdolmaleki, H. Abdulsamad, and G. Neumann, “Model-free trajectory optimization for reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2016, pp. 4342-4352.
      [BibTeX] [Abstract] [Download PDF]

      Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

      @inproceedings{lirolem25747,
      volume = {6},
      year = {2016},
      journal = {33rd International Conference on Machine Learning, ICML 2016},
      author = {R. Akrour and A. Abdolmaleki and H. Abdulsamad and G. Neumann},
      title = {Model-free trajectory optimization for reinforcement learning},
      month = {June},
      pages = {4342--4352},
      booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
      keywords = {ARRAY(0x56147f374c78)},
      abstract = {Many of the recent Trajectory Optimization algorithms
      alternate between local approximation
      of the dynamics and conservative policy update.
      However, linearly approximating the dynamics
      in order to derive the new policy can bias the update
      and prevent convergence to the optimal policy.
      In this article, we propose a new model-free
      algorithm that backpropagates a local quadratic
      time-dependent Q-Function, allowing the derivation
      of the policy update in closed form. Our policy
      update ensures exact KL-constraint satisfaction
      without simplifying assumptions on the system
      dynamics demonstrating improved performance
      in comparison to related Trajectory Optimization
      algorithms linearizing the dynamics.},
      url = {http://eprints.lincoln.ac.uk/25747/}
      }

  • Machine Learning Journal 2016: Probabilistic inference for determining options in reinforcement learning

    Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

    • C. Daniel, H. van Hoof, J. Peters, and G. Neumann, “Probabilistic inference for determining options in reinforcement learning,” Machine Learning, vol. 104, iss. 2-3, pp. 337-357, 2016.
      [BibTeX] [Abstract] [Download PDF]

      Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

      @article{lirolem25739,
      number = {2-3},
      month = {September},
      title = {Probabilistic inference for determining options in reinforcement learning},
      pages = {337--357},
      publisher = {Springer},
      author = {C. Daniel and H. van Hoof and J. Peters and G. Neumann},
      year = {2016},
      journal = {Machine Learning},
      volume = {104},
      keywords = {ARRAY(0x56147f374c18)},
      url = {http://eprints.lincoln.ac.uk/25739/},
      abstract = {Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.}
      }

  • JMLR 2016: Hierarchical Relative Entropy Policy Search

    Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored
    for real world settings and complete methods, derived from first principles, are needed. Real world settings are challenging due to large and continuous state-action spaces that are prohibitive for exhaustive sampling methods. We define the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-policies for execution by the agent. In order to efficiently share experience with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables which allows for distribution of the update information between the sub-policies. We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons.

    • C. Daniel, G. Neumann, O. Kroemer, and J. Peters, “Hierarchical relative entropy policy search,” Journal of Machine Learning Research, vol. 17, pp. 1-50, 2016.
      [BibTeX] [Abstract] [Download PDF]

      Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored for real world settings and complete methods, derived from first principles, are needed. Real world settings are challenging due to large and continuous state-action spaces that are prohibitive for exhaustive sampling methods. We define the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-policies for execution by the agent. In order to efficiently share experience with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables which allows for distribution of the update information between the sub-policies. We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons.

      @article{lirolem25743,
      volume = {17},
      year = {2016},
      journal = {Journal of Machine Learning Research},
      month = {June},
      title = {Hierarchical relative entropy policy search},
      pages = {1--50},
      author = {C. Daniel and G. Neumann and O. Kroemer and J. Peters},
      publisher = {Massachusetts Institute of Technology Press (MIT Press) / Microtome Publishing},
      keywords = {ARRAY(0x56147f374cd8)},
      abstract = {Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that
      are strongly structured. Such task structures can be exploited by incorporating hierarchical policies
      that consist of gating networks and sub-policies. However, this concept has only been partially explored
      for real world settings and complete methods, derived from first principles, are needed. Real
      world settings are challenging due to large and continuous state-action spaces that are prohibitive
      for exhaustive sampling methods. We define the problem of learning sub-policies in continuous
      state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to
      select the low-level sub-policies for execution by the agent. In order to efficiently share experience
      with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables
      which allows for distribution of the update information between the sub-policies. We present three
      different variants of our algorithm, designed to be suitable for a wide variety of real world robot
      learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several
      simulations and comparisons.},
      url = {http://eprints.lincoln.ac.uk/25743/}
      }