Publications

2018

  • H. van Hoof, G. Neumann, and J. Peters, “Non-parametric policy search with limited information loss,” Journal of Machine Learning Research, 2018.
    [BibTeX] [Abstract] [Download PDF]

    Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the dataset. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value functions, which might lead to poor convergence or oscillations in the policy update. A more robust policy update can be obtained by limiting the information loss between successive state-action distributions. In this paper, we develop a policy search algorithm with policy updates that are both robust and non-parametric. Our method can learn non-parametric control policies for infinite horizon continuous Markov decision processes with non-linear and redundant sensory representations. We investigate how we can use approximations of the kernel function to reduce the time requirements of the demanding non-parametric computations. In our experiments, we show the strong performance of the proposed method, and how it can be approximated effi- ciently. Finally, we show that our algorithm can learn a real-robot underpowered swing-up task directly from image data.

    @article{lirolem28020,
    author = {Herke van Hoof and Gerhard Neumann and Jan Peters},
    year = {2018},
    title = {Non-parametric policy search with limited information loss},
    publisher = {Journal of Machine Learning Research},
    journal = {Journal of Machine Learning Research},
    month = {December},
    keywords = {ARRAY(0x56147fc33978)},
    url = {http://eprints.lincoln.ac.uk/28020/},
    abstract = {Learning complex control policies from non-linear and redundant sensory input is an important
    challenge for reinforcement learning algorithms. Non-parametric methods that
    approximate values functions or transition models can address this problem, by adapting
    to the complexity of the dataset. Yet, many current non-parametric approaches rely on
    unstable greedy maximization of approximate value functions, which might lead to poor
    convergence or oscillations in the policy update. A more robust policy update can be obtained
    by limiting the information loss between successive state-action distributions. In this
    paper, we develop a policy search algorithm with policy updates that are both robust and
    non-parametric. Our method can learn non-parametric control policies for infinite horizon
    continuous Markov decision processes with non-linear and redundant sensory representations.
    We investigate how we can use approximations of the kernel function to reduce the
    time requirements of the demanding non-parametric computations. In our experiments, we
    show the strong performance of the proposed method, and how it can be approximated effi-
    ciently. Finally, we show that our algorithm can learn a real-robot underpowered swing-up
    task directly from image data.}
    }

  • A. Paraschos, C. Daniel, J. Peters, and G. Neumann, “Using probabilistic movement primitives in robotics,” Autonomous Robots, vol. 42, iss. 3, pp. 529-551, 2018.
    [BibTeX] [Abstract] [Download PDF]

    Movement Primitives are a well-established paradigm for modular movement representation and generation. They provide a data-driven representation of movements and support generalization to novel situations, temporal modulation, sequencing of primitives and controllers for executing the primitive on physical systems. However, while many MP frameworks exhibit some of these properties, there is a need for a uni- fied framework that implements all of them in a principled way. In this paper, we show that this goal can be achieved by using a probabilistic representation. Our approach models trajectory distributions learned from stochastic movements. Probabilistic operations, such as conditioning can be used to achieve generalization to novel situations or to combine and blend movements in a principled way. We derive a stochastic feedback controller that reproduces the encoded variability of the movement and the coupling of the degrees of freedom of the robot. We evaluate and compare our approach on several simulated and real robot scenarios.

    @article{lirolem27883,
    pages = {529--551},
    month = {March},
    journal = {Autonomous Robots},
    title = {Using probabilistic movement primitives in robotics},
    year = {2018},
    author = {Alexandros Paraschos and Christian Daniel and Jan Peters and Gerhard Neumann},
    number = {3},
    volume = {42},
    publisher = {Springer Verlag},
    abstract = {Movement Primitives are a well-established
    paradigm for modular movement representation and
    generation. They provide a data-driven representation
    of movements and support generalization to novel situations,
    temporal modulation, sequencing of primitives
    and controllers for executing the primitive on physical
    systems. However, while many MP frameworks exhibit
    some of these properties, there is a need for a uni-
    fied framework that implements all of them in a principled
    way. In this paper, we show that this goal can be
    achieved by using a probabilistic representation. Our
    approach models trajectory distributions learned from
    stochastic movements. Probabilistic operations, such as
    conditioning can be used to achieve generalization to
    novel situations or to combine and blend movements in
    a principled way. We derive a stochastic feedback controller
    that reproduces the encoded variability of the
    movement and the coupling of the degrees of freedom
    of the robot. We evaluate and compare our approach
    on several simulated and real robot scenarios.},
    url = {http://eprints.lincoln.ac.uk/27883/},
    keywords = {ARRAY(0x56147fc36920)}
    }

2017

  • A. Abdolmaleki, B. Price, N. Lau, L. P. Reis, and G. Neumann, “Deriving and improving CMA-ES with Information geometric trust regions,” in The Genetic and Evolutionary Computation Conference (GECCO 2017), 2017.
    [BibTeX] [Abstract] [Download PDF]

    CMA-ES is one of the most popular stochastic search algorithms. It performs favourably in many tasks without the need of extensive parameter tuning. The algorithm has many beneficial properties, including automatic step-size adaptation, efficient covariance updates that incorporates the current samples as well as the evolution path and its invariance properties. Its update rules are composed of well established heuristics where the theoretical foundations of some of these rules are also well understood. In this paper we will fully derive all CMA-ES update rules within the framework of expectation-maximisation-based stochastic search algorithms using information-geometric trust regions. We show that the use of the trust region results in similar updates to CMA-ES for the mean and the covariance matrix while it allows for the derivation of an improved update rule for the step-size. Our new algorithm, Trust-Region Covariance Matrix Adaptation Evolution Strategy (TR-CMA-ES) is fully derived from first order optimization principles and performs favourably in compare to standard CMA-ES algorithm.

    @inproceedings{lirolem27056,
    booktitle = {The Genetic and Evolutionary Computation Conference (GECCO 2017)},
    month = {July},
    author = {Abbas Abdolmaleki and Bob Price and Nuno Lau and Luis Paulo Reis and Gerhard Neumann},
    year = {2017},
    title = {Deriving and improving CMA-ES with Information geometric trust regions},
    keywords = {ARRAY(0x56147fc33918)},
    url = {http://eprints.lincoln.ac.uk/27056/},
    abstract = {CMA-ES is one of the most popular stochastic search algorithms.
    It performs favourably in many tasks without the need of extensive
    parameter tuning. The algorithm has many beneficial properties,
    including automatic step-size adaptation, efficient covariance updates
    that incorporates the current samples as well as the evolution
    path and its invariance properties. Its update rules are composed
    of well established heuristics where the theoretical foundations of
    some of these rules are also well understood. In this paper we
    will fully derive all CMA-ES update rules within the framework of
    expectation-maximisation-based stochastic search algorithms using
    information-geometric trust regions. We show that the use of the trust
    region results in similar updates to CMA-ES for the mean and the
    covariance matrix while it allows for the derivation of an improved
    update rule for the step-size. Our new algorithm, Trust-Region Covariance
    Matrix Adaptation Evolution Strategy (TR-CMA-ES) is
    fully derived from first order optimization principles and performs
    favourably in compare to standard CMA-ES algorithm.}
    }

  • A. Abdolmaleki, B. Price, N. Lau, P. Reis, and G. Neumann, “Contextual CMA-ES,” in International Joint Conference on Artificial Intelligence (IJCAI), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Many stochastic search algorithms are designed to optimize a fixed objective function to learn a task, i.e., if the objective function changes slightly, for example, due to a change in the situation or context of the task, relearning is required to adapt to the new context. For instance, if we want to learn a kicking movement for a soccer robot, we have to relearn the movement for different ball locations. Such relearning is undesired as it is highly inefficient and many applications require a fast adaptation to a new context/situation. Therefore, we investigate contextual stochastic search algorithms that can learn multiple, similar tasks simultaneously. Current contextual stochastic search methods are based on policy search algorithms and suffer from premature convergence and the need for parameter tuning. In this paper, we extend the well known CMA-ES algorithm to the contextual setting and illustrate its performance on several contextual tasks. Our new algorithm, called contextual CMAES, leverages from contextual learning while it preserves all the features of standard CMA-ES such as stability, avoidance of premature convergence, step size control and a minimal amount of parameter tuning.

    @inproceedings{lirolem28141,
    title = {Contextual CMA-ES},
    year = {2017},
    author = {A. Abdolmaleki and B. Price and N. Lau and P. Reis and G. Neumann},
    month = {August},
    booktitle = {International Joint Conference on Artificial Intelligence (IJCAI)},
    keywords = {ARRAY(0x56147fc369c8)},
    abstract = {Many stochastic search algorithms are designed to optimize a fixed objective function to learn a task, i.e., if the objective function changes slightly, for example, due to a change in the situation or context of the task, relearning is required to adapt to the new context. For instance, if we want to learn a kicking movement for a soccer robot, we have to relearn the movement for different ball locations. Such relearning is undesired as it is highly inefficient and many applications require a fast adaptation to a new context/situation. Therefore, we investigate contextual stochastic search algorithms
    that can learn multiple, similar tasks simultaneously. Current contextual stochastic search methods are based on policy search algorithms and suffer from premature convergence and the need for parameter tuning. In this paper, we extend the well known CMA-ES algorithm to the contextual setting and illustrate its performance on several contextual
    tasks. Our new algorithm, called contextual CMAES, leverages from contextual learning while it preserves all the features of standard CMA-ES such as stability, avoidance of premature convergence, step size control and a minimal amount of parameter tuning.},
    url = {http://eprints.lincoln.ac.uk/28141/}
    }

  • H. Abdulsamad, O. Arenz, J. Peters, and G. Neumann, “State-regularized policy search for linearized dynamical systems,” in Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of feedbackcontrollers by taking advantage of local approximations of model dynamics and cost functions. Stability of the policy update is a major issue for these methods, rendering them hard to apply for highly nonlinear systems. Recent approaches combine classical Stochastic Optimal Control methods with information-theoretic bounds to control the step-size of the policy update and could even be used to train nonlinear deep control policies. These methods bound the relative entropy between the new and the old policy to ensure a stable policy update. However, despite the bound in policy space, the state distributions of two consecutive policies can still differ significantly, rendering the used local approximate models invalid. To alleviate this issue we propose enforcing a relative entropy constraint not only on the policy update, but also on the update of the state distribution, around which the dynamics and cost are being approximated. We present a derivation of the closed-form policy update and show that our approach outperforms related methods on two nonlinear and highly dynamic simulated systems.

    @inproceedings{lirolem27055,
    author = {Hany Abdulsamad and Oleg Arenz and Jan Peters and Gerhard Neumann},
    year = {2017},
    title = {State-regularized policy search for linearized dynamical systems},
    month = {June},
    booktitle = {Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS)},
    keywords = {ARRAY(0x56147fc522b8)},
    abstract = {Trajectory-Centric Reinforcement Learning and Trajectory
    Optimization methods optimize a sequence of feedbackcontrollers
    by taking advantage of local approximations of
    model dynamics and cost functions. Stability of the policy update
    is a major issue for these methods, rendering them hard
    to apply for highly nonlinear systems. Recent approaches
    combine classical Stochastic Optimal Control methods with
    information-theoretic bounds to control the step-size of the
    policy update and could even be used to train nonlinear deep
    control policies. These methods bound the relative entropy
    between the new and the old policy to ensure a stable policy
    update. However, despite the bound in policy space, the
    state distributions of two consecutive policies can still differ
    significantly, rendering the used local approximate models invalid.
    To alleviate this issue we propose enforcing a relative
    entropy constraint not only on the policy update, but also on
    the update of the state distribution, around which the dynamics
    and cost are being approximated. We present a derivation
    of the closed-form policy update and show that our approach
    outperforms related methods on two nonlinear and highly dynamic
    simulated systems.},
    url = {http://eprints.lincoln.ac.uk/27055/}
    }

  • R. Akrour, D. Sorokin, J. Peters, and G. Neumann, “Local Bayesian optimization of motor skills,” in International Conference on Machine Learning (ICML), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Bayesian optimization is renowned for its sample efficiency but its application to higher dimensional tasks is impeded by its focus on global optimization. To scale to higher dimensional problems, we leverage the sample efficiency of Bayesian optimization in a local context. The optimization of the acquisition function is restricted to the vicinity of a Gaussian search distribution which is moved towards high value areas of the objective. The proposed informationtheoretic update of the search distribution results in a Bayesian interpretation of local stochastic search: the search distribution encodes prior knowledge on the optimum?s location and is weighted at each iteration by the likelihood of this location?s optimality. We demonstrate the effectiveness of our algorithm on several benchmark objective functions as well as a continuous robotic task in which an informative prior is obtained by imitation learning.

    @inproceedings{lirolem27902,
    author = {R. Akrour and D. Sorokin and J. Peters and G. Neumann},
    year = {2017},
    title = {Local Bayesian optimization of motor skills},
    month = {August},
    booktitle = {International Conference on Machine Learning (ICML)},
    keywords = {ARRAY(0x56147fc37070)},
    url = {http://eprints.lincoln.ac.uk/27902/},
    abstract = {Bayesian optimization is renowned for its sample
    efficiency but its application to higher dimensional
    tasks is impeded by its focus on global
    optimization. To scale to higher dimensional
    problems, we leverage the sample efficiency of
    Bayesian optimization in a local context. The
    optimization of the acquisition function is restricted
    to the vicinity of a Gaussian search distribution
    which is moved towards high value areas
    of the objective. The proposed informationtheoretic
    update of the search distribution results
    in a Bayesian interpretation of local stochastic
    search: the search distribution encodes prior
    knowledge on the optimum?s location and is
    weighted at each iteration by the likelihood of
    this location?s optimality. We demonstrate the
    effectiveness of our algorithm on several benchmark
    objective functions as well as a continuous
    robotic task in which an informative prior is obtained
    by imitation learning.}
    }

  • F. End, R. Akrour, J. Peters, and G. Neumann, “Layered direct policy search for learning hierarchical skills,” in International Conference on Robotics and Automation (ICRA), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Solutions to real world robotic tasks often require complex behaviors in high dimensional continuous state and action spaces. Reinforcement Learning (RL) is aimed at learning such behaviors but often fails for lack of scalability. To address this issue, Hierarchical RL (HRL) algorithms leverage hierarchical policies to exploit the structure of a task. However, many HRL algorithms rely on task specific knowledge such as a set of predefined sub-policies or sub-goals. In this paper we propose a new HRL algorithm based on information theoretic principles to autonomously uncover a diverse set of sub-policies and their activation policies. Moreover, the learning process mirrors the policys structure and is thus also hierarchical, consisting of a set of independent optimization problems. The hierarchical structure of the learning process allows us to control the learning rate of the sub-policies and the gating individually and add specific information theoretic constraints to each layer to ensure the diversification of the subpolicies. We evaluate our algorithm on two high dimensional continuous tasks and experimentally demonstrate its ability to autonomously discover a rich set of sub-policies.

    @inproceedings{lirolem26737,
    booktitle = {International Conference on Robotics and Automation (ICRA)},
    month = {May},
    author = {F. End and R. Akrour and J. Peters and G. Neumann},
    year = {2017},
    title = {Layered direct policy search for learning hierarchical skills},
    keywords = {ARRAY(0x56147fc522d0)},
    url = {http://eprints.lincoln.ac.uk/26737/},
    abstract = {Solutions to real world robotic tasks often require
    complex behaviors in high dimensional continuous state and
    action spaces. Reinforcement Learning (RL) is aimed at learning
    such behaviors but often fails for lack of scalability. To
    address this issue, Hierarchical RL (HRL) algorithms leverage
    hierarchical policies to exploit the structure of a task. However,
    many HRL algorithms rely on task specific knowledge such
    as a set of predefined sub-policies or sub-goals. In this paper
    we propose a new HRL algorithm based on information
    theoretic principles to autonomously uncover a diverse set
    of sub-policies and their activation policies. Moreover, the
    learning process mirrors the policys structure and is thus also
    hierarchical, consisting of a set of independent optimization
    problems. The hierarchical structure of the learning process
    allows us to control the learning rate of the sub-policies and
    the gating individually and add specific information theoretic
    constraints to each layer to ensure the diversification of the subpolicies.
    We evaluate our algorithm on two high dimensional
    continuous tasks and experimentally demonstrate its ability to
    autonomously discover a rich set of sub-policies.}
    }

  • F. B. Farraj, T. Osa, N. Pedemonte, J. Peters, G. Neumann, and P. R. Giordano, “A learning-based shared control architecture for interactive task execution,” in IEEE International Conference on Robotics and Automation (ICRA), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Shared control is a key technology for various robotic applications in which a robotic system and a human operator are meant to collaborate efficiently. In order to achieve efficient task execution in shared control, it is essential to predict the desired behavior for a given situation or context to simplify the control task for the human operator. To do this prediction, we use Learning from Demonstration (LfD), which is a popular approach for transferring human skills to robots. We encode the demonstrated behavior as trajectory distributions and generalize the learned distributions to new situations. The goal of this paper is to present a shared control framework that uses learned expert distributions to gain more autonomy. Our approach controls the balance between the controller?s autonomy and the human preference based on the distributions of the demonstrated trajectories. Moreover, the learned distributions are autonomously refined from collaborative task executions, resulting in a master-slave system with increasing autonomy that requires less user input with an increasing number of task executions. We experimentally validated that our shared control approach enables efficient task executions. Moreover, the conducted experiments demonstrated that the developed system improves its performances through interactive task executions with our shared control.

    @inproceedings{lirolem26738,
    author = {F. B. Farraj and T. Osa and N. Pedemonte and J. Peters and G. Neumann and P. R. Giordano},
    year = {2017},
    title = {A learning-based shared control architecture for interactive task execution},
    publisher = {IEEE},
    month = {May},
    booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
    keywords = {ARRAY(0x56147fc52240)},
    url = {http://eprints.lincoln.ac.uk/26738/},
    abstract = {Shared control is a key technology for various
    robotic applications in which a robotic system and a human
    operator are meant to collaborate efficiently. In order to achieve
    efficient task execution in shared control, it is essential to
    predict the desired behavior for a given situation or context
    to simplify the control task for the human operator. To do this
    prediction, we use Learning from Demonstration (LfD), which is
    a popular approach for transferring human skills to robots. We
    encode the demonstrated behavior as trajectory distributions
    and generalize the learned distributions to new situations. The
    goal of this paper is to present a shared control framework
    that uses learned expert distributions to gain more autonomy.
    Our approach controls the balance between the controller?s
    autonomy and the human preference based on the distributions
    of the demonstrated trajectories. Moreover, the learned
    distributions are autonomously refined from collaborative task
    executions, resulting in a master-slave system with increasing
    autonomy that requires less user input with an increasing
    number of task executions. We experimentally validated that
    our shared control approach enables efficient task executions.
    Moreover, the conducted experiments demonstrated that the
    developed system improves its performances through interactive
    task executions with our shared control.}
    }

  • A. Gabriel, R. Akrour, J. Peters, and G. Neumann, “Empowered skills,” in International Conference on Robotics and Automation (ICRA), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Robot Reinforcement Learning (RL) algorithms return a policy that maximizes a global cumulative reward signal but typically do not create diverse behaviors. Hence, the policy will typically only capture a single solution of a task. However, many motor tasks have a large variety of solutions and the knowledge about these solutions can have several advantages. For example, in an adversarial setting such as robot table tennis, the lack of diversity renders the behavior predictable and hence easy to counter for the opponent. In an interactive setting such as learning from human feedback, an emphasis on diversity gives the human more opportunity for guiding the robot and to avoid the latter to be stuck in local optima of the task. In order to increase diversity of the learned behaviors, we leverage prior work on intrinsic motivation and empowerment. We derive a new intrinsic motivation signal by enriching the description of a task with an outcome space, representing interesting aspects of a sensorimotor stream. For example, in table tennis, the outcome space could be given by the return position and return ball speed. The intrinsic motivation is now given by the diversity of future outcomes, a concept also known as empowerment. We derive a new policy search algorithm that maximizes a trade-off between the extrinsic reward and this intrinsic motivation criterion. Experiments on a planar reaching task and simulated robot table tennis demonstrate that our algorithm can learn a diverse set of behaviors within the area of interest of the tasks.

    @inproceedings{lirolem26736,
    title = {Empowered skills},
    year = {2017},
    author = {A. Gabriel and R. Akrour and J. Peters and G. Neumann},
    month = {May},
    booktitle = {International Conference on Robotics and Automation (ICRA)},
    url = {http://eprints.lincoln.ac.uk/26736/},
    abstract = {Robot Reinforcement Learning (RL) algorithms
    return a policy that maximizes a global cumulative reward
    signal but typically do not create diverse behaviors. Hence, the
    policy will typically only capture a single solution of a task.
    However, many motor tasks have a large variety of solutions
    and the knowledge about these solutions can have several
    advantages. For example, in an adversarial setting such as
    robot table tennis, the lack of diversity renders the behavior
    predictable and hence easy to counter for the opponent. In an
    interactive setting such as learning from human feedback, an
    emphasis on diversity gives the human more opportunity for
    guiding the robot and to avoid the latter to be stuck in local
    optima of the task. In order to increase diversity of the learned
    behaviors, we leverage prior work on intrinsic motivation and
    empowerment. We derive a new intrinsic motivation signal by
    enriching the description of a task with an outcome space,
    representing interesting aspects of a sensorimotor stream. For
    example, in table tennis, the outcome space could be given
    by the return position and return ball speed. The intrinsic
    motivation is now given by the diversity of future outcomes,
    a concept also known as empowerment. We derive a new
    policy search algorithm that maximizes a trade-off between
    the extrinsic reward and this intrinsic motivation criterion.
    Experiments on a planar reaching task and simulated robot
    table tennis demonstrate that our algorithm can learn a diverse
    set of behaviors within the area of interest of the tasks.},
    keywords = {ARRAY(0x56147fc52330)}
    }

  • G. H. W. Gebhardt, K. Daun, M. Schnaubelt, A. Hendrich, D. Kauth, and G. Neumann, “Learning to assemble objects with a robot swarm,” in Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems (AAMAS 17), 2017, pp. 1547-1549.
    [BibTeX] [Abstract] [Download PDF]

    Large populations of simple robots can solve complex tasks, but controlling them is still a challenging problem, due to limited communication and computation power. In order to assemble objects, have shown that a human controller can solve such a task. Instead, we investigate how to learn the assembly of multiple objects with a single central controller. We propose splitting the assembly process in two sub-tasks — generating a top-level assembly policy and learning an object movement policy. The assembly policy plans the trajectories for each object and the object movement policy controls the trajectory execution.The resulting system is able to solve assembly tasks with varying object shapes being assembled as shown in multiple simulation scenarios.

    @inproceedings{lirolem28089,
    year = {2017},
    title = {Learning to assemble objects with a robot swarm},
    author = {Gregor H. W. Gebhardt and Kevin Daun and Marius Schnaubelt and Alexander Hendrich and Daniel Kauth and Gerhard Neumann},
    pages = {1547--1549},
    month = {May},
    publisher = {international foundation for autonomous agents and multiagent systems},
    booktitle = {Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems (AAMAS 17)},
    note = {Extended abstract},
    keywords = {ARRAY(0x56147fc52228)},
    url = {http://eprints.lincoln.ac.uk/28089/},
    abstract = {Large populations of simple robots can solve complex tasks, but controlling them is still a challenging problem, due to limited communication and computation power. In order to assemble objects, have shown that a human controller can solve such a task. Instead, we investigate how to learn the assembly of multiple objects with a single central controller. We propose splitting the assembly process in two sub-tasks -- generating a top-level assembly policy and learning an object movement policy. The assembly policy plans the trajectories for each object and the object movement policy controls the trajectory execution.The resulting system is able to solve assembly tasks with varying object shapes being assembled as shown in multiple simulation scenarios.}
    }

  • G. H. W. Gebhardt, A. Kupcsik, and G. Neumann, “The kernel Kalman rule: efficient nonparametric inference with recursive least squares,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    [BibTeX] [Abstract] [Download PDF]

    Nonparametric inference techniques provide promising tools for probabilistic reasoning in high-dimensional nonlinear systems. Most of these techniques embed distributions into reproducing kernel Hilbert spaces (RKHS) and rely on the kernel Bayes? rule (KBR) to manipulate the embeddings. However, the computational demands of the KBR scale poorly with the number of samples and the KBR often suffers from numerical instabilities. In this paper, we present the kernel Kalman rule (KKR) as an alternative to the KBR. The derivation of the KKR is based on recursive least squares, inspired by the derivation of the Kalman innovation update. We apply the KKR to filtering tasks where we use RKHS embeddings to represent the belief state, resulting in the kernel Kalman filter (KKF). We show on a nonlinear state estimation task with high dimensional observations that our approach provides a significantly improved estimation accuracy while the computational demands are significantly decreased.

    @inproceedings{lirolem26739,
    booktitle = {Thirty-First AAAI Conference on Artificial Intelligence},
    month = {February},
    author = {G. H. W. Gebhardt and A. Kupcsik and G. Neumann},
    publisher = {AAAI},
    year = {2017},
    title = {The kernel Kalman rule: efficient nonparametric inference with recursive least squares},
    abstract = {Nonparametric inference techniques provide promising tools
    for probabilistic reasoning in high-dimensional nonlinear systems.
    Most of these techniques embed distributions into reproducing
    kernel Hilbert spaces (RKHS) and rely on the kernel
    Bayes? rule (KBR) to manipulate the embeddings. However,
    the computational demands of the KBR scale poorly
    with the number of samples and the KBR often suffers from
    numerical instabilities. In this paper, we present the kernel
    Kalman rule (KKR) as an alternative to the KBR. The derivation
    of the KKR is based on recursive least squares, inspired
    by the derivation of the Kalman innovation update. We apply
    the KKR to filtering tasks where we use RKHS embeddings
    to represent the belief state, resulting in the kernel Kalman filter
    (KKF). We show on a nonlinear state estimation task with
    high dimensional observations that our approach provides a
    significantly improved estimation accuracy while the computational
    demands are significantly decreased.},
    url = {http://eprints.lincoln.ac.uk/26739/},
    keywords = {ARRAY(0x56147fc521f8)}
    }

  • A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat, and G. Neumann, “Model-based contextual policy search for data-efficient generalization of robot skills,” Artificial Intelligence, vol. 247, pp. 415-439, 2017.
    [BibTeX] [Abstract] [Download PDF]

    In robotics, lower-level controllers are typically used to make the robot solve a specific task in a fixed context. For example, the lower-level controller can encode a hitting movement while the context defines the target coordinates to hit. However, in many learning problems the context may change between task executions. To adapt the policy to a new context, we utilize a hierarchical approach by learning an upper-level policy that generalizes the lower-level controllers to new contexts. A common approach to learn such upper-level policies is to use policy search. However, the majority of current contextual policy search approaches are model-free and require a high number of interactions with the robot and its environment. Model-based approaches are known to significantly reduce the amount of robot experiments, however, current model-based techniques cannot be applied straightforwardly to the problem of learning contextual upper-level policies. They rely on specific parametrizations of the policy and the reward function, which are often unrealistic in the contextual policy search formulation. In this paper, we propose a novel model-based contextual policy search algorithm that is able to generalize lower-level controllers, and is data-efficient. Our approach is based on learned probabilistic forward models and information theoretic policy search. Unlike current algorithms, our method does not require any assumption on the parametrization of the policy or the reward function. We show on complex simulated robotic tasks and in a real robot experiment that the proposed learning framework speeds up the learning process by up to two orders of magnitude in comparison to existing methods, while learning high quality policies.

    @article{lirolem25774,
    author = {A. Kupcsik and M. P. Deisenroth and J. Peters and A. P. Loh and P. Vadakkepat and G. Neumann},
    year = {2017},
    title = {Model-based contextual policy search for data-efficient generalization of robot skills},
    journal = {Artificial Intelligence},
    pages = {415--439},
    month = {June},
    publisher = {Elsevier},
    volume = {247},
    abstract = {In robotics, lower-level controllers are typically used to make the robot solve a specific task in a fixed context. For example, the lower-level controller can encode a hitting movement while the context defines the target coordinates to hit. However, in many learning problems the context may change between task executions. To adapt the policy to a new context, we utilize a hierarchical approach by learning an upper-level policy that generalizes the lower-level controllers to new contexts. A common approach to learn such upper-level policies is to use policy search. However, the majority of current contextual policy search approaches are model-free and require a high number of interactions with the robot and its environment. Model-based approaches are known to significantly reduce the amount of robot experiments, however, current model-based techniques cannot be applied straightforwardly to the problem of learning contextual upper-level policies. They rely on specific parametrizations of the policy and the reward function, which are often unrealistic in the contextual policy search formulation. In this paper, we propose a novel model-based contextual policy search algorithm that is able to generalize lower-level controllers, and is data-efficient. Our approach is based on learned probabilistic forward models and information theoretic policy search. Unlike current algorithms, our method does not require any assumption on the parametrization of the policy or the reward function. We show on complex simulated robotic tasks and in a real robot experiment that the proposed learning framework speeds up the learning process by up to two orders of magnitude in comparison to existing methods, while learning high quality policies.},
    url = {http://eprints.lincoln.ac.uk/25774/},
    keywords = {ARRAY(0x56147fc52210)}
    }

  • R. Lioutikov, G. Neumann, G. Maeda, and J. Peters, “Learning movement primitive libraries through probabilistic segmentation,” International Journal of Robotics Research (IJRR), vol. 36, iss. 8, pp. 879-894, 2017.
    [BibTeX] [Abstract] [Download PDF]

    Movement primitives are a well established approach for encoding and executing movements. While the primitives themselves have been extensively researched, the concept of movement primitive libraries has not received similar attention. Libraries of movement primitives represent the skill set of an agent. Primitives can be queried and sequenced in order to solve specific tasks. The goal of this work is to segment unlabeled demonstrations into a representative set of primitives. Our proposed method differs from current approaches by taking advantage of the often neglected, mutual dependencies between the segments contained in the demonstrations and the primitives to be encoded. By exploiting this mutual dependency, we show that we can improve both the segmentation and the movement primitive library. Based on probabilistic inference our novel approach segments the demonstrations while learning a probabilistic representation of movement primitives. We demonstrate our method on two real robot applications. First, the robot segments sequences of different letters into a library, explaining the observed trajectories. Second, the robot segments demonstrations of a chair assembly task into a movement primitive library. The library is subsequently used to assemble the chair in an order not present in the demonstrations.

    @article{lirolem28021,
    year = {2017},
    title = {Learning movement primitive libraries through probabilistic segmentation},
    number = {8},
    author = {Rudolf Lioutikov and Gerhard Neumann and Guilherme Maeda and Jan Peters},
    month = {July},
    pages = {879--894},
    journal = {International Journal of Robotics Research (IJRR)},
    publisher = {SAGE},
    volume = {36},
    keywords = {ARRAY(0x56147fc52288)},
    url = {http://eprints.lincoln.ac.uk/28021/},
    abstract = {Movement primitives are a well established approach for encoding and executing movements. While the primitives
    themselves have been extensively researched, the concept of movement primitive libraries has not received similar
    attention. Libraries of movement primitives represent the skill set of an agent. Primitives can be queried and sequenced
    in order to solve specific tasks. The goal of this work is to segment unlabeled demonstrations into a representative
    set of primitives. Our proposed method differs from current approaches by taking advantage of the often neglected,
    mutual dependencies between the segments contained in the demonstrations and the primitives to be encoded. By
    exploiting this mutual dependency, we show that we can improve both the segmentation and the movement primitive
    library. Based on probabilistic inference our novel approach segments the demonstrations while learning a probabilistic
    representation of movement primitives. We demonstrate our method on two real robot applications. First, the robot
    segments sequences of different letters into a library, explaining the observed trajectories. Second, the robot segments
    demonstrations of a chair assembly task into a movement primitive library. The library is subsequently used to assemble the chair in an order not present in the demonstrations.}
    }

  • G. J. Maeda, G. Neumann, M. Ewerton, R. Lioutikov, O. Kroemer, and J. Peters, “Probabilistic movement primitives for coordination of multiple human?robot collaborative tasks,” Autonomous Robots, vol. 41, iss. 3, pp. 593-612, 2017.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes an interaction learning method for collaborative and assistive robots based on movement primitives. The method allows for both action recognition and human?robot movement coordination. It uses imitation learning to construct a mixture model of human?robot interaction primitives. This probabilistic model allows the assistive trajectory of the robot to be inferred from human observations. The method is scalable in relation to the number of tasks and can learn nonlinear correlations between the trajectories that describe the human?robot interaction. We evaluated the method experimentally with a lightweight robot arm in a variety of assistive scenarios, including the coordinated handover of a bottle to a human, and the collaborative assembly of a toolbox. Potential applications of the method are personal caregiver robots, control of intelligent prosthetic devices, and robot coworkers in factories.

    @article{lirolem25744,
    publisher = {Springer},
    note = {Special Issue on Assistive and Rehabilitation Robotics},
    volume = {41},
    number = {3},
    author = {G. J. Maeda and G. Neumann and M. Ewerton and R. Lioutikov and O. Kroemer and J. Peters},
    title = {Probabilistic movement primitives for coordination of multiple human?robot collaborative tasks},
    year = {2017},
    journal = {Autonomous Robots},
    month = {March},
    pages = {593--612},
    abstract = {This paper proposes an interaction learning method for collaborative and assistive robots based on movement primitives. The method allows for both action recognition and human?robot movement coordination. It uses imitation learning to construct a mixture model of human?robot interaction primitives. This probabilistic model allows the assistive trajectory of the robot to be inferred from human observations. The method is scalable in relation to the number of tasks and can learn nonlinear correlations between the trajectories that describe the human?robot interaction. We evaluated the method experimentally with a lightweight robot arm in a variety of assistive scenarios, including the coordinated handover of a bottle to a human, and the collaborative assembly of a toolbox. Potential applications of the method are personal caregiver robots, control of intelligent prosthetic devices, and robot coworkers in factories.},
    url = {http://eprints.lincoln.ac.uk/25744/},
    keywords = {ARRAY(0x56147fc52360)}
    }

  • G. Maeda, M. Ewerton, G. Neumann, R. Lioutikov, and J. Peters, “Phase estimation for fast action recognition and trajectory generation in human?robot collaboration,” The International Journal of Robotics Research, vol. 36, iss. 13-14, pp. 1579-1594, 2017.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a method to achieve fast and fluid human?robot interaction by estimating the progress of the movement of the human. The method allows the progress, also referred to as the phase of the movement, to be estimated even when observations of the human are partial and occluded; a problem typically found when using motion capture systems in cluttered environments. By leveraging on the framework of Interaction Probabilistic Movement Primitives, phase estimation makes it possible to classify the human action, and to generate a corresponding robot trajectory before the human finishes his/her movement. The method is therefore suited for semi-autonomous robots acting as assistants and coworkers. Since observations may be sparse, our method is based on computing the probability of different phase candidates to find the phase that best aligns the Interaction Probabilistic Movement Primitives with the current observations. The method is fundamentally different from approaches based on Dynamic Time Warping that must rely on a consistent stream of measurements at runtime. The resulting framework can achieve phase estimation, action recognition and robot trajectory coordination using a single probabilistic representation. We evaluated the method using a seven-degree-of-freedom lightweight robot arm equipped with a five-finger hand in single and multi-task collaborative experiments. We compare the accuracy achieved by phase estimation with our previous method based on dynamic time warping.

    @article{lirolem26734,
    publisher = {SAGE},
    volume = {36},
    year = {2017},
    title = {Phase estimation for fast action recognition and trajectory generation in human?robot collaboration},
    number = {13-14},
    author = {Guilherme Maeda and Marco Ewerton and Gerhard Neumann and Rudolf Lioutikov and Jan Peters},
    month = {December},
    pages = {1579--1594},
    journal = {The International Journal of Robotics Research},
    abstract = {This paper proposes a method to achieve fast and fluid human?robot interaction by estimating the progress of the movement of the human. The method allows the progress, also referred to as the phase of the movement, to be estimated even when observations of the human are partial and occluded; a problem typically found when using motion capture systems in cluttered environments. By leveraging on the framework of Interaction Probabilistic Movement Primitives, phase estimation makes it possible to classify the human action, and to generate a corresponding robot trajectory before the human finishes his/her movement. The method is therefore suited for semi-autonomous robots acting as assistants and coworkers. Since observations may be sparse, our method is based on computing the probability of different phase candidates to find the phase that best aligns the Interaction Probabilistic Movement Primitives with the current observations. The method is fundamentally different from approaches based on Dynamic Time Warping that must rely on a consistent stream of measurements at runtime. The resulting framework can achieve phase estimation, action recognition and robot trajectory coordination using a single probabilistic representation. We evaluated the method using a seven-degree-of-freedom lightweight robot arm equipped with a five-finger hand in single and multi-task collaborative experiments. We compare the accuracy achieved by phase estimation with our previous method based on dynamic time warping.},
    url = {http://eprints.lincoln.ac.uk/26734/},
    keywords = {ARRAY(0x56147fc37088)}
    }

  • T. Osa, A. G. M. Esfahani, R. Stolkin, R. Lioutikov, J. Peters, and G. Neumann, “Guiding trajectory optimization by demonstrated distributions,” IEEE Robotics and Automation Letters (RA-L), vol. 2, iss. 2, pp. 819-826, 2017.
    [BibTeX] [Abstract] [Download PDF]

    Trajectory optimization is an essential tool for motion planning under multiple constraints of robotic manipulators. Optimization-based methods can explicitly optimize a trajectory by leveraging prior knowledge of the system and have been used in various applications such as collision avoidance. However, these methods often require a hand-coded cost function in order to achieve the desired behavior. Specifying such cost function for a complex desired behavior, e.g., disentangling a rope, is a nontrivial task that is often even infeasible. Learning from demonstration (LfD) methods offer an alternative way to program robot motion. LfD methods are less dependent on analytical models and instead learn the behavior of experts implicitly from the demonstrated trajectories. However, the problem of adapting the demonstrations to new situations, e.g., avoiding newly introduced obstacles, has not been fully investigated in the literature. In this paper, we present a motion planning framework that combines the advantages of optimization-based and demonstration-based methods. We learn a distribution of trajectories demonstrated by human experts and use it to guide the trajectory optimization process. The resulting trajectory maintains the demonstrated behaviors, which are essential to performing the task successfully, while adapting the trajectory to avoid obstacles. In simulated experiments and with a real robotic system, we verify that our approach optimizes the trajectory to avoid obstacles and encodes the demonstrated behavior in the resulting trajectory

    @article{lirolem26731,
    volume = {2},
    publisher = {IEEE},
    journal = {IEEE Robotics and Automation Letters (RA-L)},
    month = {January},
    pages = {819--826},
    number = {2},
    author = {Takayuki Osa and Amir M. Ghalamzan Esfahani and Rustam Stolkin and Rudolf Lioutikov and Jan Peters and Gerhard Neumann},
    title = {Guiding trajectory optimization by demonstrated distributions},
    year = {2017},
    url = {http://eprints.lincoln.ac.uk/26731/},
    abstract = {Trajectory optimization is an essential tool for motion
    planning under multiple constraints of robotic manipulators.
    Optimization-based methods can explicitly optimize a trajectory
    by leveraging prior knowledge of the system and have been used
    in various applications such as collision avoidance. However, these
    methods often require a hand-coded cost function in order to
    achieve the desired behavior. Specifying such cost function for
    a complex desired behavior, e.g., disentangling a rope, is a nontrivial
    task that is often even infeasible. Learning from demonstration
    (LfD) methods offer an alternative way to program robot
    motion. LfD methods are less dependent on analytical models
    and instead learn the behavior of experts implicitly from the
    demonstrated trajectories. However, the problem of adapting the
    demonstrations to new situations, e.g., avoiding newly introduced
    obstacles, has not been fully investigated in the literature. In this
    paper, we present a motion planning framework that combines
    the advantages of optimization-based and demonstration-based
    methods. We learn a distribution of trajectories demonstrated by
    human experts and use it to guide the trajectory optimization
    process. The resulting trajectory maintains the demonstrated
    behaviors, which are essential to performing the task successfully,
    while adapting the trajectory to avoid obstacles. In simulated
    experiments and with a real robotic system, we verify that our
    approach optimizes the trajectory to avoid obstacles and encodes
    the demonstrated behavior in the resulting trajectory},
    keywords = {ARRAY(0x56147fc52438)}
    }

  • J. Pajarinen, V. Kyrki, M. Koval, S. Srinivasa, J. Peters, and G. Neumann, “Hybrid control trajectory optimization under uncertainty,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Trajectory optimization is a fundamental problem in robotics. While optimization of continuous control trajectories is well developed, many applications require both discrete and continuous, i.e. hybrid controls. Finding an optimal sequence of hybrid controls is challenging due to the exponential explosion of discrete control combinations. Our method, based on Differential Dynamic Programming (DDP), circumvents this problem by incorporating discrete actions inside DDP: we first optimize continuous mixtures of discrete actions, and, subsequently force the mixtures into fully discrete actions. Moreover, we show how our approach can be extended to partially observable Markov decision processes (POMDPs) for trajectory planning under uncertainty. We validate the approach in a car driving problem where the robot has to switch discrete gears and in a box pushing application where the robot can switch the side of the box to push. The pose and the friction parameters of the pushed box are initially unknown and only indirectly observable.

    @inproceedings{lirolem28257,
    title = {Hybrid control trajectory optimization under uncertainty},
    year = {2017},
    author = {J. Pajarinen and V. Kyrki and M. Koval and S Srinivasa and J. Peters and G. Neumann},
    month = {September},
    booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    abstract = {Trajectory optimization is a fundamental problem in robotics. While optimization of continuous control trajectories is well developed, many applications require both discrete and continuous, i.e. hybrid controls. Finding an optimal sequence of hybrid controls is challenging due to the exponential explosion of discrete control combinations. Our method, based on Differential Dynamic Programming (DDP), circumvents this problem by incorporating discrete actions inside DDP: we first optimize continuous mixtures of discrete actions, and, subsequently force the mixtures into fully discrete actions. Moreover, we show how our approach can be extended to partially observable Markov decision processes (POMDPs) for trajectory planning under uncertainty. We validate the approach in a car driving problem where the robot has to switch discrete gears and in a box pushing application where the robot can switch the side of the box to push. The pose and the friction parameters of the pushed box are initially unknown and only indirectly observable.},
    url = {http://eprints.lincoln.ac.uk/28257/},
    keywords = {ARRAY(0x56147fc36ba8)}
    }

  • A. Paraschos, R. Lioutikov, J. Peters, and G. Neumann, “Probabilistic prioritization of movement primitives,” IEEE Robotics and Automation Letters, vol. PP, iss. 99, 2017.
    [BibTeX] [Abstract] [Download PDF]

    Movement prioritization is a common approach to combine controllers of different tasks for redundant robots, where each task is assigned a priority. The priorities of the tasks are often hand-tuned or the result of an optimization, but seldomly learned from data. This paper combines Bayesian task prioritization with probabilistic movement primitives to prioritize full motion sequences that are learned from demonstrations. Probabilistic movement primitives (ProMPs) can encode distributions of movements over full motion sequences and provide control laws to exactly follow these distributions. The probabilistic formulation allows for a natural application of Bayesian task prioritization. We extend the ProMP controllers with an additional feedback component that accounts inaccuracies in following the distribution and allows for a more robust prioritization of primitives. We demonstrate how the task priorities can be obtained from imitation learning and how different primitives can be combined to solve even unseen task-combinations. Due to the prioritization, our approach can efficiently learn a combination of tasks without requiring individual models per task combination. Further, our approach can adapt an existing primitive library by prioritizing additional controllers, for example, for implementing obstacle avoidance. Hence, the need of retraining the whole library is avoided in many cases. We evaluate our approach on reaching movements under constraints with redundant simulated planar robots and two physical robot platforms, the humanoid robot ?iCub? and a KUKA LWR robot arm.

    @article{lirolem27901,
    number = {99},
    author = {Alexandros Paraschos and Rudolf Lioutikov and Jan Peters and Gerhard Neumann},
    title = {Probabilistic prioritization of movement primitives},
    year = {2017},
    journal = {IEEE Robotics and Automation Letters},
    month = {July},
    publisher = {IEEE},
    booktitle = {, Proceedings of the International Conference on Intelligent Robot Systems, and IEEE Robotics and Automation Letters (RA-L)},
    volume = {PP},
    abstract = {Movement prioritization is a common approach
    to combine controllers of different tasks for redundant robots,
    where each task is assigned a priority. The priorities of the
    tasks are often hand-tuned or the result of an optimization,
    but seldomly learned from data. This paper combines Bayesian
    task prioritization with probabilistic movement primitives to
    prioritize full motion sequences that are learned from demonstrations.
    Probabilistic movement primitives (ProMPs) can
    encode distributions of movements over full motion sequences
    and provide control laws to exactly follow these distributions.
    The probabilistic formulation allows for a natural application of
    Bayesian task prioritization. We extend the ProMP controllers
    with an additional feedback component that accounts inaccuracies
    in following the distribution and allows for a more
    robust prioritization of primitives. We demonstrate how the
    task priorities can be obtained from imitation learning and
    how different primitives can be combined to solve even unseen
    task-combinations. Due to the prioritization, our approach can
    efficiently learn a combination of tasks without requiring individual
    models per task combination. Further, our approach can
    adapt an existing primitive library by prioritizing additional
    controllers, for example, for implementing obstacle avoidance.
    Hence, the need of retraining the whole library is avoided in
    many cases. We evaluate our approach on reaching movements
    under constraints with redundant simulated planar robots and
    two physical robot platforms, the humanoid robot ?iCub? and
    a KUKA LWR robot arm.},
    url = {http://eprints.lincoln.ac.uk/27901/},
    keywords = {ARRAY(0x56147fc36908)}
    }

  • V. Tangkaratt, H. van Hoof, S. Parisi, G. Neumann, J. Peters, and M. Sugiyama, “Policy search with high-dimensional context variables,” in AAAI Conference on Artificial Intelligence (AAAI), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Direct contextual policy search methods learn to improve policy parameters and simultaneously generalize these parameters to different context or task variables. However, learning from high-dimensional context variables, such as camera images, is still a prominent problem in many real-world tasks. A naive application of unsupervised dimensionality reduction methods to the context variables, such as principal component analysis, is insufficient as task-relevant input may be ignored. In this paper, we propose a contextual policy search method in the model-based relative entropy stochastic search framework with integrated dimensionality reduction. We learn a model of the reward that is locally quadratic in both the policy parameters and the context variables. Furthermore, we perform supervised linear dimensionality reduction on the context variables by nuclear norm regularization. The experimental results show that the proposed method outperforms naive dimensionality reduction via principal component analysis and a state-of-the-art contextual policy search method.

    @inproceedings{lirolem26740,
    publisher = {Association for the Advancement of Artificial Intelligence},
    title = {Policy search with high-dimensional context variables},
    year = {2017},
    author = {V. Tangkaratt and H. van Hoof and S. Parisi and G. Neumann and J. Peters and M. Sugiyama},
    month = {February},
    booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
    keywords = {ARRAY(0x56147fc52408)},
    url = {http://eprints.lincoln.ac.uk/26740/},
    abstract = {Direct contextual policy search methods learn to improve policy
    parameters and simultaneously generalize these parameters
    to different context or task variables. However, learning
    from high-dimensional context variables, such as camera images,
    is still a prominent problem in many real-world tasks.
    A naive application of unsupervised dimensionality reduction
    methods to the context variables, such as principal component
    analysis, is insufficient as task-relevant input may be ignored.
    In this paper, we propose a contextual policy search method in
    the model-based relative entropy stochastic search framework
    with integrated dimensionality reduction. We learn a model of
    the reward that is locally quadratic in both the policy parameters
    and the context variables. Furthermore, we perform supervised
    linear dimensionality reduction on the context variables
    by nuclear norm regularization. The experimental results
    show that the proposed method outperforms naive dimensionality
    reduction via principal component analysis and
    a state-of-the-art contextual policy search method.}
    }

2016

  • A. Abdolmaleki, N. Lau, P. L. Reis, and G. Neumann, “Contextual stochastic search,” in Genetic and Evolutionary Computation Conference GECCO 2016, 2016, pp. 29-30.
    [BibTeX] [Abstract] [Download PDF]

    Stochastic search algorithms have recently also gained a lot of attention in operations research, machine learning and policy search of robot motor skills due to their ease of use and their generality. Yet, many stochastic search algorithms require relearning if the task changes slightly to adapt the solution to the new situation or the new context. Therefore we consider the contextual stochastic search setup. Here, we want to find good parameter vectors for multiple related tasks, where each task is described by a continuous context vector. Hence, the objective might change slightly for each parameter vector evaluation. In this research, we investigate the contextual stochastic search algorithms that can learn from multiple tasks simultaneously.

    @inproceedings{lirolem25679,
    author = {A. Abdolmaleki and N. Lau and L. Paulo Reis and G. Neumann},
    publisher = {ACM},
    year = {2016},
    title = {Contextual stochastic search},
    month = {July},
    booktitle = {Genetic and Evolutionary Computation Conference GECCO 2016},
    pages = {29--30},
    abstract = {Stochastic search algorithms have recently also gained a lot of attention in operations research, machine learning and policy search of robot motor skills due to their ease of use and their generality. Yet, many stochastic search algorithms require relearning if the task changes slightly to adapt the solution to the new situation or the new context. Therefore we consider the contextual stochastic search setup. Here, we want to find good parameter vectors for multiple related tasks, where each task is described by a continuous context vector. Hence, the objective might change slightly for each parameter vector evaluation. In this research, we investigate the contextual stochastic search algorithms that can learn from multiple tasks simultaneously.},
    url = {http://eprints.lincoln.ac.uk/25679/},
    keywords = {ARRAY(0x56147fc52528)}
    }

  • A. Abdolmaleki, N. Lau, L. P. Reis, and G. Neumann, “Non-parametric contextual stochastic search,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, 2016, pp. 2643-2648.
    [BibTeX] [Abstract] [Download PDF]

    Stochastic search algorithms are black-box optimizer of an objective function. They have recently gained a lot of attention in operations research, machine learning and policy search of robot motor skills due to their ease of use and their generality. Yet, many stochastic search algorithms require relearning if the task or objective function changes slightly to adapt the solution to the new situation or the new context. In this paper, we consider the contextual stochastic search setup. Here, we want to find multiple good parameter vectors for multiple related tasks, where each task is described by a continuous context vector. Hence, the objective function might change slightly for each parameter vector evaluation of a task or context. Contextual algorithms have been investigated in the field of policy search, however, the search distribution typically uses a parametric model that is linear in the some hand-defined context features. Finding good context features is a challenging task, and hence, non-parametric methods are often preferred over their parametric counter-parts. In this paper, we propose a non-parametric contextual stochastic search algorithm that can learn a non-parametric search distribution for multiple tasks simultaneously. In difference to existing methods, our method can also learn a context dependent covariance matrix that guides the exploration of the search process. We illustrate its performance on several non-linear contextual tasks.

    @inproceedings{lirolem25738,
    author = {A. Abdolmaleki and N. Lau and L.P. Reis and G. Neumann},
    year = {2016},
    title = {Non-parametric contextual stochastic search},
    journal = {IEEE International Conference on Intelligent Robots and Systems},
    month = {October},
    pages = {2643--2648},
    booktitle = {Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on},
    volume = {2016-N},
    keywords = {ARRAY(0x56147fc52468)},
    url = {http://eprints.lincoln.ac.uk/25738/},
    abstract = {Stochastic search algorithms are black-box optimizer of an objective function. They have recently gained a lot of attention in operations research, machine learning and policy search of robot motor skills due to their ease of use and their generality. Yet, many stochastic search algorithms require relearning if the task or objective function changes slightly to adapt the solution to the new situation or the new context. In this paper, we consider the contextual stochastic search setup. Here, we want to find multiple good parameter vectors for multiple related tasks, where each task is described by a continuous context vector. Hence, the objective function might change slightly for each parameter vector evaluation of a task or context. Contextual algorithms have been investigated in the field of policy search, however, the search distribution typically uses a parametric model that is linear in the some hand-defined context features. Finding good context features is a challenging task, and hence, non-parametric methods are often preferred over their parametric counter-parts. In this paper, we propose a non-parametric contextual stochastic search algorithm that can learn a non-parametric search distribution for multiple tasks simultaneously. In difference to existing methods, our method can also learn a context dependent covariance matrix that guides the exploration of the search process. We illustrate its performance on several non-linear contextual tasks.}
    }

  • A. Abdolmaleki, R. Lioutikov, N. Lua, P. L. Reis, J. Peters, and G. Neumann, “Model-based relative entropy stochastic search,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 153-154.
    [BibTeX] [Abstract] [Download PDF]

    Stochastic search algorithms are general black-box optimizers. Due to their ease of use and their generality, they have recently also gained a lot of attention in operations research, machine learning and policy search. Yet, these algorithms require a lot of evaluations of the objective, scale poorly with the problem dimension, are affected by highly noisy objective functions and may converge prematurely. To alleviate these problems, we introduce a new surrogate-based stochastic search approach. We learn simple, quadratic surrogate models of the objective function. As the quality of such a quadratic approximation is limited, we do not greedily exploit the learned models. The algorithm can be misled by an inaccurate optimum introduced by the surrogate. Instead, we use information theoretic constraints to bound the ?distance? between the new and old data distribution while maximizing the objective function. Additionally the new method is able to sustain the exploration of the search distribution to avoid premature convergence. We compare our method with state of art black-box optimization methods on standard uni-modal and multi-modal optimization functions, on simulated planar robot tasks and a complex robot ball throwing task. The proposed method considerably outperforms the existing approaches.

    @inproceedings{lirolem25741,
    author = {A. Abdolmaleki and R. Lioutikov and N. Lua and L. Paulo Reis and J. Peters and G. Neumann},
    year = {2016},
    title = {Model-based relative entropy stochastic search},
    journal = {GECCO 2016 Companion - Proceedings of the 2016 Genetic and Evolutionary Computation Conference},
    pages = {153--154},
    booktitle = {Advances in Neural Information Processing Systems (NIPS)},
    keywords = {ARRAY(0x56147fc52630)},
    url = {http://eprints.lincoln.ac.uk/25741/},
    abstract = {Stochastic search algorithms are general black-box optimizers. Due to their ease
    of use and their generality, they have recently also gained a lot of attention in operations
    research, machine learning and policy search. Yet, these algorithms require
    a lot of evaluations of the objective, scale poorly with the problem dimension, are
    affected by highly noisy objective functions and may converge prematurely. To
    alleviate these problems, we introduce a new surrogate-based stochastic search
    approach. We learn simple, quadratic surrogate models of the objective function.
    As the quality of such a quadratic approximation is limited, we do not greedily exploit
    the learned models. The algorithm can be misled by an inaccurate optimum
    introduced by the surrogate. Instead, we use information theoretic constraints to
    bound the ?distance? between the new and old data distribution while maximizing
    the objective function. Additionally the new method is able to sustain the exploration
    of the search distribution to avoid premature convergence. We compare our
    method with state of art black-box optimization methods on standard uni-modal
    and multi-modal optimization functions, on simulated planar robot tasks and a
    complex robot ball throwing task. The proposed method considerably outperforms
    the existing approaches.}
    }

  • A. Abdolmaleki, N. Lau, L. P. Reis, J. Peters, and G. Neumann, “Contextual policy search for linear and nonlinear generalization of a humanoid walking controller,” Journal of Intelligent and Robotic Systems: Theory and Applications, vol. 83, iss. 3, pp. 393-408, 2016.
    [BibTeX] [Abstract] [Download PDF]

    We investigate learning of flexible robot locomotion controllers, i.e., the controllers should be applicable for multiple contexts, for example different walking speeds, various slopes of the terrain or other physical properties of the robot. In our experiments, contexts are desired walking linear speed of the gait. Current approaches for learning control parameters of biped locomotion controllers are typically only applicable for a single context. They can be used for a particular context, for example to learn a gait with highest speed, lowest energy consumption or a combination of both. The question of our research is, how can we obtain a flexible walking controller that controls the robot (near) optimally for many different contexts? We achieve the desired flexibility of the controller by applying the recently developed contextual relative entropy policy search(REPS) method which generalizes the robot walking controller for different contexts, where a context is described by a real valued vector. In this paper we also extend the contextual REPS algorithm to learn a non-linear policy instead of a linear policy over the contexts which call it RBF-REPS as it uses Radial Basis Functions. In order to validate our method, we perform three simulation experiments including a walking experiment using a simulated NAO humanoid robot. The robot learns a policy to choose the controller parameters for a continuous set of forward walking speeds.

    @article{lirolem25745,
    journal = {Journal of Intelligent and Robotic Systems: Theory and Applications},
    pages = {393--408},
    month = {September},
    author = {Abbas Abdolmaleki and Nuno Lau and Luis Paulo Reis and Jan Peters and Gerhard Neumann},
    number = {3},
    year = {2016},
    title = {Contextual policy search for linear and nonlinear generalization of a humanoid walking controller},
    volume = {83},
    publisher = {Springer},
    keywords = {ARRAY(0x56147fc524e0)},
    abstract = {We investigate learning of flexible robot locomotion controllers, i.e., the controllers should be applicable for multiple contexts, for example different walking speeds, various slopes of the terrain or other physical properties of the robot. In our experiments, contexts are desired walking linear speed of the gait. Current approaches for learning control parameters of biped locomotion controllers are typically only applicable for a single context. They can be used for a particular context, for example to learn a gait with highest speed, lowest energy consumption or a combination of both. The question of our research is, how can we obtain a flexible walking controller that controls the robot (near) optimally for many different contexts? We achieve the desired flexibility of the controller by applying the recently developed contextual relative entropy policy search(REPS) method which generalizes the robot walking controller for different contexts, where a context is described by a real valued vector. In this paper we also extend the contextual REPS algorithm to learn a non-linear policy instead of a linear policy over the contexts which call it RBF-REPS as it uses Radial Basis Functions. In order to validate our method, we perform three simulation experiments including a walking experiment using a simulated NAO humanoid robot. The robot learns a policy to choose the controller parameters for a continuous set of forward walking speeds.},
    url = {http://eprints.lincoln.ac.uk/25745/}
    }

  • R. Akrour, A. Abdolmaleki, H. Abdulsamad, and G. Neumann, “Model-free trajectory optimization for reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2016, pp. 4342-4352.
    [BibTeX] [Abstract] [Download PDF]

    Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

    @inproceedings{lirolem25747,
    year = {2016},
    title = {Model-free trajectory optimization for reinforcement learning},
    author = {R. Akrour and A. Abdolmaleki and H. Abdulsamad and G. Neumann},
    pages = {4342--4352},
    month = {June},
    journal = {33rd International Conference on Machine Learning, ICML 2016},
    volume = {6},
    booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
    abstract = {Many of the recent Trajectory Optimization algorithms
    alternate between local approximation
    of the dynamics and conservative policy update.
    However, linearly approximating the dynamics
    in order to derive the new policy can bias the update
    and prevent convergence to the optimal policy.
    In this article, we propose a new model-free
    algorithm that backpropagates a local quadratic
    time-dependent Q-Function, allowing the derivation
    of the policy update in closed form. Our policy
    update ensures exact KL-constraint satisfaction
    without simplifying assumptions on the system
    dynamics demonstrating improved performance
    in comparison to related Trajectory Optimization
    algorithms linearizing the dynamics.},
    url = {http://eprints.lincoln.ac.uk/25747/},
    keywords = {ARRAY(0x56147fc52480)}
    }

  • O. Arenz, H. Abdulsamad, and G. Neumann, “Optimal control and inverse optimal control by distribution matching,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, 2016, pp. 4046-4053.
    [BibTeX] [Abstract] [Download PDF]

    Optimal control is a powerful approach to achieve optimal behavior. However, it typically requires a manual specification of a cost function which often contains several objectives, such as reaching goal positions at different time steps or energy efficiency. Manually trading-off these objectives is often difficult and requires a high engineering effort. In this paper, we present a new approach to specify optimal behavior. We directly specify the desired behavior by a distribution over future states or features of the states. For example, the experimenter could choose to reach certain mean positions with given accuracy/variance at specified time steps. Our approach also unifies optimal control and inverse optimal control in one framework. Given a desired state distribution, we estimate a cost function such that the optimal controller matches the desired distribution. If the desired distribution is estimated from expert demonstrations, our approach performs inverse optimal control. We evaluate our approach on several optimal and inverse optimal control tasks on non-linear systems using incremental linearizations similar to differential dynamic programming approaches.

    @inproceedings{lirolem25737,
    booktitle = {Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on},
    volume = {2016-N},
    title = {Optimal control and inverse optimal control by distribution matching},
    year = {2016},
    author = {O. Arenz and H. Abdulsamad and G. Neumann},
    month = {October},
    pages = {4046--4053},
    journal = {IEEE International Conference on Intelligent Robots and Systems},
    keywords = {ARRAY(0x56147fc52498)},
    abstract = {Optimal control is a powerful approach to achieve optimal behavior. However, it typically requires a manual specification of a cost function which often contains several objectives, such as reaching goal positions at different time steps or energy efficiency. Manually trading-off these objectives is often difficult and requires a high engineering effort. In this paper, we present a new approach to specify optimal behavior. We directly specify the desired behavior by a distribution over future states or features of the states. For example, the experimenter could choose to reach certain mean positions with given accuracy/variance at specified time steps. Our approach also unifies optimal control and inverse optimal control in one framework. Given a desired state distribution, we estimate a cost function such that the optimal controller matches the desired distribution. If the desired distribution is estimated from expert demonstrations, our approach performs inverse optimal control. We evaluate our approach on several optimal and inverse optimal control tasks on non-linear systems using incremental linearizations similar to differential dynamic programming approaches.},
    url = {http://eprints.lincoln.ac.uk/25737/}
    }

  • C. Daniel, H. van Hoof, J. Peters, and G. Neumann, “Probabilistic inference for determining options in reinforcement learning,” Machine Learning, vol. 104, iss. 2-3, pp. 337-357, 2016.
    [BibTeX] [Abstract] [Download PDF]

    Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

    @article{lirolem25739,
    publisher = {Springer},
    volume = {104},
    title = {Probabilistic inference for determining options in reinforcement learning},
    year = {2016},
    number = {2-3},
    author = {C. Daniel and H. van Hoof and J. Peters and G. Neumann},
    month = {September},
    pages = {337--357},
    journal = {Machine Learning},
    url = {http://eprints.lincoln.ac.uk/25739/},
    abstract = {Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.},
    keywords = {ARRAY(0x56147fc524f8)}
    }

  • C. Daniel, G. Neumann, O. Kroemer, and J. Peters, “Hierarchical relative entropy policy search,” Journal of Machine Learning Research, vol. 17, pp. 1-50, 2016.
    [BibTeX] [Abstract] [Download PDF]

    Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored for real world settings and complete methods, derived from first principles, are needed. Real world settings are challenging due to large and continuous state-action spaces that are prohibitive for exhaustive sampling methods. We define the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-policies for execution by the agent. In order to efficiently share experience with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables which allows for distribution of the update information between the sub-policies. We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons.

    @article{lirolem25743,
    publisher = {Massachusetts Institute of Technology Press (MIT Press) / Microtome Publishing},
    volume = {17},
    title = {Hierarchical relative entropy policy search},
    year = {2016},
    author = {C. Daniel and G. Neumann and O. Kroemer and J. Peters},
    pages = {1--50},
    month = {June},
    journal = {Journal of Machine Learning Research},
    abstract = {Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that
    are strongly structured. Such task structures can be exploited by incorporating hierarchical policies
    that consist of gating networks and sub-policies. However, this concept has only been partially explored
    for real world settings and complete methods, derived from first principles, are needed. Real
    world settings are challenging due to large and continuous state-action spaces that are prohibitive
    for exhaustive sampling methods. We define the problem of learning sub-policies in continuous
    state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to
    select the low-level sub-policies for execution by the agent. In order to efficiently share experience
    with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables
    which allows for distribution of the update information between the sub-policies. We present three
    different variants of our algorithm, designed to be suitable for a wide variety of real world robot
    learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several
    simulations and comparisons.},
    url = {http://eprints.lincoln.ac.uk/25743/},
    keywords = {ARRAY(0x56147fc52570)}
    }

  • M. Ewerton, G. Maeda, G. Neumann, V. Kisner, G. Kollegger, J. Wiemeyer, and J. Peters, “Movement primitives with multiple phase parameters,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on, 2016, pp. 201-206.
    [BibTeX] [Abstract] [Download PDF]

    Movement primitives are concise movement representations that can be learned from human demonstrations, support generalization to novel situations and modulate the speed of execution of movements. The speed modulation mechanisms proposed so far are limited though, allowing only for uniform speed modulation or coupling changes in speed to local measurements of forces, torques or other quantities. Those approaches are not enough when dealing with general velocity constraints. We present a movement primitive formulation that can be used to non-uniformly adapt the speed of execution of a movement in order to satisfy a given constraint, while maintaining similarity in shape to the original trajectory. We present results using a 4-DoF robot arm in a minigolf setup.

    @inproceedings{lirolem25742,
    volume = {2016-J},
    booktitle = {Robotics and Automation (ICRA), 2016 IEEE International Conference on},
    pages = {201--206},
    month = {June},
    journal = {Proceedings - IEEE International Conference on Robotics and Automation},
    title = {Movement primitives with multiple phase parameters},
    year = {2016},
    author = {M. Ewerton and G. Maeda and G. Neumann and V. Kisner and G. Kollegger and J. Wiemeyer and J. Peters},
    url = {http://eprints.lincoln.ac.uk/25742/},
    abstract = {Movement primitives are concise movement representations that can be learned from human demonstrations, support generalization to novel situations and modulate the speed of execution of movements. The speed modulation mechanisms proposed so far are limited though, allowing only for uniform speed modulation or coupling changes in speed to local measurements of forces, torques or other quantities. Those approaches are not enough when dealing with general velocity constraints. We present a movement primitive formulation that can be used to non-uniformly adapt the speed of execution of a movement in order to satisfy a given constraint, while maintaining similarity in shape to the original trajectory. We present results using a 4-DoF robot arm in a minigolf setup.},
    keywords = {ARRAY(0x56147fc523f0)}
    }

  • V. Modugno, G. Neumann, E. Rueckert, G. Oriolo, J. Peters, and S. Ivaldi, “Learning soft task priorities for control of redundant robots,” in IEEE International Conference on Robotics and Automation (ICRA) 2016, 2016.
    [BibTeX] [Abstract] [Download PDF]

    Movement primitives (MPs) provide a powerful framework for data driven movement generation that has been successfully applied for learning from demonstrations and robot reinforcement learning. In robotics we often want to solve a multitude of different, but related tasks. As the parameters of the primitives are typically high dimensional, a common practice for the generalization of movement primitives to new tasks is to adapt only a small set of control variables, also called meta parameters, of the primitive. Yet, for most MP representations, the encoding of these control variables is precoded in the representation and can not be adapted to the considered tasks. In this paper, we want to learn the encoding of task-specific control variables also from data instead of relying on fixed meta-parameter representations. We use hierarchical Bayesian models (HBMs) to estimate a low dimensional latent variable model for probabilistic movement primitives (ProMPs), which is a recent movement primitive representation. We show on two real robot datasets that ProMPs based on HBMs outperform standard ProMPs in terms of generalization and learning from a small amount of data and also allows for an intuitive analysis of the movement. We also extend our HBM by a mixture model, such that we can model different movement types in the same dataset.

    @inproceedings{lirolem25639,
    author = {V. Modugno and Gerhard Neumann and E. Rueckert and G. Oriolo and J. Peters and S. Ivaldi},
    year = {2016},
    title = {Learning soft task priorities for control of redundant robots},
    month = {May},
    booktitle = {IEEE International Conference on Robotics and Automation (ICRA) 2016},
    abstract = {Movement primitives (MPs) provide a powerful
    framework for data driven movement generation that has been
    successfully applied for learning from demonstrations and robot
    reinforcement learning. In robotics we often want to solve a
    multitude of different, but related tasks. As the parameters
    of the primitives are typically high dimensional, a common
    practice for the generalization of movement primitives to new
    tasks is to adapt only a small set of control variables, also
    called meta parameters, of the primitive. Yet, for most MP
    representations, the encoding of these control variables is precoded
    in the representation and can not be adapted to the
    considered tasks. In this paper, we want to learn the encoding of
    task-specific control variables also from data instead of relying
    on fixed meta-parameter representations. We use hierarchical
    Bayesian models (HBMs) to estimate a low dimensional latent
    variable model for probabilistic movement primitives (ProMPs),
    which is a recent movement primitive representation. We show
    on two real robot datasets that ProMPs based on HBMs
    outperform standard ProMPs in terms of generalization and
    learning from a small amount of data and also allows for an
    intuitive analysis of the movement. We also extend our HBM by
    a mixture model, such that we can model different movement
    types in the same dataset.},
    url = {http://eprints.lincoln.ac.uk/25639/},
    keywords = {ARRAY(0x56147fc525a0)}
    }

  • T. Osa, J. Peters, and G. Neumann, “Experiments with hierarchical reinforcement learning of multiple grasping policies,” in Proceedings of the International Symposium on Experimental Robotics (ISER), 2016.
    [BibTeX] [Abstract] [Download PDF]

    Robotic grasping has attracted considerable interest, but it still remains a challenging task. The data-driven approach is a promising solution to the robotic grasping problem; this approach leverages a grasp dataset and generalizes grasps for various objects. However, these methods often depend on the quality of the given datasets, which are not trivial to obtain with sufficient quality. Although reinforcement learning approaches have been recently used to achieve autonomous collection of grasp datasets, the existing algorithms are often limited to specific grasp types. In this paper, we present a framework for hierarchical reinforcement learning of grasping policies. In our framework, the lowerlevel hierarchy learns multiple grasp types, and the upper-level hierarchy learns a policy to select from the learned grasp types according to a point cloud of a new object. Through experiments, we validate that our approach learns grasping by constructing the grasp dataset autonomously. The experimental results show that our approach learns multiple grasping policies and generalizes the learned grasps by using local point cloud information.

    @inproceedings{lirolem26735,
    booktitle = {Proceedings of the International Symposium on Experimental Robotics (ISER)},
    month = {April},
    author = {T. Osa and J. Peters and G. Neumann},
    year = {2016},
    title = {Experiments with hierarchical reinforcement learning of multiple grasping policies},
    abstract = {Robotic grasping has attracted considerable interest, but it
    still remains a challenging task. The data-driven approach is a promising
    solution to the robotic grasping problem; this approach leverages a
    grasp dataset and generalizes grasps for various objects. However, these
    methods often depend on the quality of the given datasets, which are not
    trivial to obtain with sufficient quality. Although reinforcement learning
    approaches have been recently used to achieve autonomous collection
    of grasp datasets, the existing algorithms are often limited to specific
    grasp types. In this paper, we present a framework for hierarchical reinforcement
    learning of grasping policies. In our framework, the lowerlevel
    hierarchy learns multiple grasp types, and the upper-level hierarchy
    learns a policy to select from the learned grasp types according to a point
    cloud of a new object. Through experiments, we validate that our approach
    learns grasping by constructing the grasp dataset autonomously.
    The experimental results show that our approach learns multiple grasping
    policies and generalizes the learned grasps by using local point cloud
    information.},
    url = {http://eprints.lincoln.ac.uk/26735/},
    keywords = {ARRAY(0x56147fc525d0)}
    }

  • C. Wirth, J. Furnkranz, and G. Neumann, “Model-free preference-based reinforcement learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 2222-2228.
    [BibTeX] [Abstract] [Download PDF]

    Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preference-based estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences.

    @inproceedings{lirolem25746,
    pages = {2222--2228},
    booktitle = {Thirtieth AAAI Conference on Artificial Intelligence},
    month = {February},
    journal = {30th AAAI Conference on Artificial Intelligence, AAAI 2016},
    title = {Model-free preference-based reinforcement learning},
    year = {2016},
    author = {C. Wirth and J. Furnkranz and G. Neumann},
    abstract = {Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preference-based estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences.},
    url = {http://eprints.lincoln.ac.uk/25746/},
    keywords = {ARRAY(0x56147fc52600)}
    }

2015

  • A. Abdolmaleki, N. Lau, L. P. Reis, J. Peters, and G. Neumann, “Contextual policy search for generalizing a parameterized biped walking controller,” in IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), 2015, pp. 17-22.
    [BibTeX] [Abstract] [Download PDF]

    We investigate learning of flexible Robot locomotion controller, i.e., the controllers should be applicable for multiple contexts, for example different walking speeds, various slopes of the terrain or other physical properties of the robot. In our experiments, contexts are desired walking linear speed and the direction of the gait. Current approaches for learning control parameters of biped locomotion controllers are typically only applicable for a single context. They can be used for a particular context, for example to learn a gait with highest speed, lowest energy consumption or a combination of both. The question of our research is, how can we obtain a flexible walking controller that controls the robot (near) optimally for many different contexts? We achieve the desired flexibility of the controller by applying the recently developed contextual relative entropy policy search(REPS) method. With such a contextual policy search algorithm, we can generalize the robot walking controller for different contexts, where a context is described by a real valued vector. In this paper we also extend the contextual REPS algorithm to learn a non-linear policy instead of a linear one over the contexts. In order to validate our method, we perform a simulation experiment using a simulated NAO humanoid robot. The robot now learns a policy to choose the controller parameters for a continuous set of walking speeds and directions.

    @inproceedings{lirolem25698,
    author = {A. Abdolmaleki and N. Lau and L. P. Reis and J. Peters and G. Neumann},
    publisher = {IEEE},
    title = {Contextual policy search for generalizing a parameterized biped walking controller},
    year = {2015},
    month = {April},
    booktitle = {IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC)},
    pages = {17--22},
    keywords = {ARRAY(0x56147fc52780)},
    url = {http://eprints.lincoln.ac.uk/25698/},
    abstract = {We investigate learning of flexible Robot locomotion controller, i.e., the controllers should be applicable for multiple contexts, for example different walking speeds, various slopes of the terrain or other physical properties of the robot. In our experiments, contexts are desired walking linear speed and the direction of the gait. Current approaches for learning control parameters of biped locomotion controllers are typically only applicable for a single context. They can be used for a particular context, for example to learn a gait with highest speed, lowest energy consumption or a combination of both. The question of our research is, how can we obtain a flexible walking controller that controls the robot (near) optimally for many different contexts? We achieve the desired flexibility of the controller by applying the recently developed contextual relative entropy policy search(REPS) method. With such a contextual policy search algorithm, we can generalize the robot walking controller for different contexts, where a context is described by a real valued vector. In this paper we also extend the contextual REPS algorithm to learn a non-linear policy instead of a linear one over the contexts. In order to validate our method, we perform a simulation experiment using a simulated NAO humanoid robot. The robot now learns a policy to choose the controller parameters for a continuous set of walking speeds and directions.}
    }

  • A. Abdolmaleki, N. Lau, L. P. Reis, and G. Neumann, “Regularized covariance estimation for weighted maximum likelihood policy search methods,” in Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th International Conference on, 2015, pp. 154-159.
    [BibTeX] [Abstract] [Download PDF]

    Many episode-based (or direct) policy search algorithms, maintain a multivariate Gaussian distribution as search distribution over the parameter space of some objective function. One class of algorithms, such as episodic REPS, PoWER or PI2 uses, a weighted maximum likelihood estimate (WMLE) to update the mean and covariance matrix of this distribution in each iteration. However, due to high dimensionality of covariance matrices and limited number of samples, the WMLE is an unreliable estimator. The use of WMLE leads to over-fitted covariance estimates, and, hence the variance/entropy of the search distribution decreases too quickly, which may cause premature convergence. In order to alleviate this problem, the estimated covariance matrix can be regularized in different ways, for example by using a convex combination of the diagonal covariance estimate and the sample covariance estimate. In this paper, we propose a new covariance matrix regularization technique for policy search methods that uses the convex combination of the sample covariance matrix and the old covariance matrix used in last iteration. The combination weighting is determined by specifying the desired entropy of the new search distribution. With this mechanism, the entropy of the search distribution can be gradually decreased without damage from the maximum likelihood estimate.

    @inproceedings{lirolem25748,
    year = {2015},
    title = {Regularized covariance estimation for weighted maximum likelihood policy search methods},
    author = {A. Abdolmaleki and N. Lau and L. P. Reis and G. Neumann},
    month = {November},
    pages = {154--159},
    journal = {IEEE-RAS International Conference on Humanoid Robots},
    booktitle = {Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th International Conference on},
    volume = {2015-D},
    keywords = {ARRAY(0x56147fc39880)},
    abstract = {Many episode-based (or direct) policy search algorithms, maintain a multivariate Gaussian distribution as search distribution over the parameter space of some objective function. One class of algorithms, such as episodic REPS, PoWER or PI2 uses, a weighted maximum likelihood estimate (WMLE) to update the mean and covariance matrix of this distribution in each iteration. However, due to high dimensionality of covariance matrices and limited number of samples, the WMLE is an unreliable estimator. The use of WMLE leads to over-fitted covariance estimates, and, hence the variance/entropy of the search distribution decreases too quickly, which may cause premature convergence. In order to alleviate this problem, the estimated covariance matrix can be regularized in different ways, for example by using a convex combination of the diagonal covariance estimate and the sample covariance estimate. In this paper, we propose a new covariance matrix regularization technique for policy search methods that uses the convex combination of the sample covariance matrix and the old covariance matrix used in last iteration. The combination weighting is determined by specifying the desired entropy of the new search distribution. With this mechanism, the entropy of the search distribution can be gradually decreased without damage from the maximum likelihood estimate.},
    url = {http://eprints.lincoln.ac.uk/25748/}
    }

  • M. Ewerton, G. Neumann, R. Lioutikov, H. B. Amor, J. Peters, and G. Maeda, “Learning multiple collaborative tasks with a mixture of interaction primitives,” in International Conference on Robotics and Automation (ICRA), 2015, pp. 1535-1542.
    [BibTeX] [Abstract] [Download PDF]

    Robots that interact with humans must learn to not only adapt to different human partners but also to new interactions. Such a form of learning can be achieved by demonstrations and imitation. A recently introduced method to learn interactions from demonstrations is the framework of Interaction Primitives. While this framework is limited to represent and generalize a single interaction pattern, in practice, interactions between a human and a robot can consist of many different patterns. To overcome this limitation this paper proposes a Mixture of Interaction Primitives to learn multiple interaction patterns from unlabeled demonstrations. Specifically the proposed method uses Gaussian Mixture Models of Interaction Primitives to model nonlinear correlations between the movements of the different agents. We validate our algorithm with two experiments involving interactive tasks between a human and a lightweight robotic arm. In the first, we compare our proposed method with conventional Interaction Primitives in a toy problem scenario where the robot and the human are not linearly correlated. In the second, we present a proof-of-concept experiment where the robot assists a human in assembling a box.

    @inproceedings{lirolem25762,
    title = {Learning multiple collaborative tasks with a mixture of interaction primitives},
    year = {2015},
    author = {Marco Ewerton and Gerhard Neumann and Rudolf Lioutikov and Heni Ben Amor and Jan Peters and Guilherme Maeda},
    number = {June},
    pages = {1535--1542},
    month = {May},
    journal = {Proceedings - IEEE International Conference on Robotics and Automation},
    publisher = {IEEE},
    volume = {2015-J},
    booktitle = {International Conference on Robotics and Automation (ICRA)},
    note = {cited By 2},
    url = {http://eprints.lincoln.ac.uk/25762/},
    abstract = {Robots that interact with humans must learn to
    not only adapt to different human partners but also to new
    interactions. Such a form of learning can be achieved by
    demonstrations and imitation. A recently introduced method
    to learn interactions from demonstrations is the framework
    of Interaction Primitives. While this framework is limited
    to represent and generalize a single interaction pattern, in
    practice, interactions between a human and a robot can consist
    of many different patterns. To overcome this limitation this
    paper proposes a Mixture of Interaction Primitives to learn
    multiple interaction patterns from unlabeled demonstrations.
    Specifically the proposed method uses Gaussian Mixture Models
    of Interaction Primitives to model nonlinear correlations
    between the movements of the different agents. We validate
    our algorithm with two experiments involving interactive tasks
    between a human and a lightweight robotic arm. In the first,
    we compare our proposed method with conventional Interaction
    Primitives in a toy problem scenario where the robot and the
    human are not linearly correlated. In the second, we present a
    proof-of-concept experiment where the robot assists a human
    in assembling a box.},
    keywords = {ARRAY(0x56147fc52708)}
    }

  • V. H. Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning robot in-hand manipulation with tactile features,” in International Conference on Humanoid Robots (HUMANOIDS), 2015, pp. 121-127.
    [BibTeX] [Abstract] [Download PDF]

    Dexterous manipulation enables repositioning of objects and tools within a robot?s hand. When applying dexterous manipulation to unknown objects, exact object models are not available. Instead of relying on models, compliance and tactile feedback can be exploited to adapt to unknown objects. However, compliant hands and tactile sensors add complexity and are themselves difficult to model. Hence, we propose acquiring in-hand manipulation skills through reinforcement learning, which does not require analytic dynamics or kinematics models. In this paper, we show that this approach successfully acquires a tactile manipulation skill using a passively compliant hand. Additionally, we show that the learned tactile skill generalizes to novel objects.

    @inproceedings{lirolem25750,
    journal = {IEEE-RAS International Conference on Humanoid Robots},
    pages = {121--127},
    month = {November},
    author = {H. Van Hoof and T. Hermans and G. Neumann and J. Peters},
    year = {2015},
    title = {Learning robot in-hand manipulation with tactile features},
    volume = {2015-D},
    booktitle = {International Conference on Humanoid Robots (HUMANOIDS)},
    abstract = {Dexterous manipulation enables repositioning of
    objects and tools within a robot?s hand. When applying dexterous
    manipulation to unknown objects, exact object models
    are not available. Instead of relying on models, compliance and
    tactile feedback can be exploited to adapt to unknown objects.
    However, compliant hands and tactile sensors add complexity
    and are themselves difficult to model. Hence, we propose acquiring
    in-hand manipulation skills through reinforcement learning,
    which does not require analytic dynamics or kinematics models.
    In this paper, we show that this approach successfully acquires
    a tactile manipulation skill using a passively compliant hand.
    Additionally, we show that the learned tactile skill generalizes
    to novel objects.},
    url = {http://eprints.lincoln.ac.uk/25750/},
    keywords = {ARRAY(0x56147fc398c8)}
    }

  • H. V. Hoof, J. Peters, and G. Neumann, “Learning of non-parametric control policies with high-dimensional state features,” Journal of Machine Learning Research: Workshop and Conference Proceedings, vol. 38, pp. 995-1003, 2015.
    [BibTeX] [Abstract] [Download PDF]

    Learning complex control policies from highdimensional sensory input is a challenge for reinforcement learning algorithms. Kernel methods that approximate values functions or transition models can address this problem. Yet, many current approaches rely on instable greedy maximization. In this paper, we develop a policy search algorithm that integrates robust policy updates and kernel embeddings. Our method can learn nonparametric control policies for infinite horizon continuous MDPs with high-dimensional sensory representations. We show that our method outperforms related approaches, and that our algorithm can learn an underpowered swing-up task task directly from highdimensional image data.

    @article{lirolem25757,
    note = {Proceedings of the 18th International Conference
    on Artificial Intelligence and Statistics (AISTATS), 9-12 May
    2015, San Diego, CA,},
    booktitle = {18th International Conference on Artificial Intelligence and Statistics (AISTATS)},
    volume = {38},
    publisher = {MIT Press},
    journal = {Journal of Machine Learning Research: Workshop and Conference Proceedings},
    month = {May},
    pages = {995--1003},
    author = {Herke Van Hoof and Jan Peters and Gerhard Neumann},
    title = {Learning of non-parametric control policies with high-dimensional state features},
    year = {2015},
    keywords = {ARRAY(0x56147fc527b0)},
    url = {http://eprints.lincoln.ac.uk/25757/},
    abstract = {Learning complex control policies from highdimensional sensory input is a challenge for
    reinforcement learning algorithms. Kernel methods that approximate values functions
    or transition models can address this problem. Yet, many current approaches rely on
    instable greedy maximization. In this paper, we develop a policy search algorithm that
    integrates robust policy updates and kernel embeddings. Our method can learn nonparametric
    control policies for infinite horizon continuous MDPs with high-dimensional
    sensory representations. We show that our method outperforms related approaches, and
    that our algorithm can learn an underpowered swing-up task task directly from highdimensional
    image data.}
    }

  • O. Kroemer, C. Daniel, G. Neumann, V. H. Hoof, and J. Peters, “Towards learning hierarchical skills for multi-phase manipulation tasks,” in IEEE International Conference on Robotics and Automation (ICRA), 2015, 2015, pp. 1503-1510.
    [BibTeX] [Abstract] [Download PDF]

    Most manipulation tasks can be decomposed into a sequence of phases, where the robot’s actions have different effects in each phase. The robot can perform actions to transition between phases and, thus, alter the effects of its actions, e.g. grasp an object in order to then lift it. The robot can thus reach a phase that affords the desired manipulation. In this paper, we present an approach for exploiting the phase structure of tasks in order to learn manipulation skills more efficiently. Starting with human demonstrations, the robot learns a probabilistic model of the phases and the phase transitions. The robot then employs model-based reinforcement learning to create a library of motor primitives for transitioning between phases. The learned motor primitives generalize to new situations and tasks. Given this library, the robot uses a value function approach to learn a high-level policy for sequencing the motor primitives. The proposed method was successfully evaluated on a real robot performing a bimanual grasping task.

    @inproceedings{lirolem25696,
    author = {O. Kroemer and C. Daniel and G. Neumann and H. Van Hoof and J. Peters},
    number = {June},
    year = {2015},
    title = {Towards learning hierarchical skills for multi-phase manipulation tasks},
    pages = {1503--1510},
    month = {May},
    publisher = {IEEE},
    volume = {2015-J},
    booktitle = {IEEE International Conference on Robotics and Automation (ICRA), 2015},
    url = {http://eprints.lincoln.ac.uk/25696/},
    abstract = {Most manipulation tasks can be decomposed into a sequence of phases, where the robot's actions have different effects in each phase. The robot can perform actions to transition between phases and, thus, alter the effects of its actions, e.g. grasp an object in order to then lift it. The robot can thus reach a phase that affords the desired manipulation. In this paper, we present an approach for exploiting the phase structure of tasks in order to learn manipulation skills more efficiently. Starting with human demonstrations, the robot learns a probabilistic model of the phases and the phase transitions. The robot then employs model-based reinforcement learning to create a library of motor primitives for transitioning between phases. The learned motor primitives generalize to new situations and tasks. Given this library, the robot uses a value function approach to learn a high-level policy for sequencing the motor primitives. The proposed method was successfully evaluated on a real robot performing a bimanual grasping task.},
    keywords = {ARRAY(0x56147fc52768)}
    }

  • O. Kroemer, C. Daniel, G. Neumann, H. V. Hoof, and J. Peters, “Towards learning hierarchical skills for multi-phase manipulation tasks,” in International Conference on Robotics and Automation (ICRA), 2015, pp. 1503-1510.
    [BibTeX] [Abstract] [Download PDF]

    Most manipulation tasks can be decomposed into a sequence of phases, where the robot?s actions have different effects in each phase. The robot can perform actions to transition between phases and, thus, alter the effects of its actions, e.g. grasp an object in order to then lift it. The robot can thus reach a phase that affords the desired manipulation. In this paper, we present an approach for exploiting the phase structure of tasks in order to learn manipulation skills more efficiently. Starting with human demonstrations, the robot learns a probabilistic model of the phases and the phase transitions. The robot then employs model-based reinforcement learning to create a library of motor primitives for transitioning between phases. The learned motor primitives generalize to new situations and tasks. Given this library, the robot uses a value function approach to learn a high-level policy for sequencing the motor primitives. The proposed method was successfully evaluated on a real robot performing a bimanual grasping task.

    @inproceedings{lirolem25759,
    journal = {Proceedings - IEEE International Conference on Robotics and Automation},
    month = {June},
    pages = {1503--1510},
    number = {June},
    author = {Oliver Kroemer and Christian Daniel and Gerhard Neumann and Herke Van Hoof and Jan Peters},
    year = {2015},
    title = {Towards learning hierarchical skills for multi-phase manipulation tasks},
    booktitle = {International Conference on Robotics and Automation (ICRA)},
    volume = {2015-J},
    publisher = {IEEE},
    keywords = {ARRAY(0x56147fc523a8)},
    abstract = {Most manipulation tasks can be decomposed into
    a sequence of phases, where the robot?s actions have different
    effects in each phase. The robot can perform actions to
    transition between phases and, thus, alter the effects of its
    actions, e.g. grasp an object in order to then lift it. The robot
    can thus reach a phase that affords the desired manipulation.
    In this paper, we present an approach for exploiting the
    phase structure of tasks in order to learn manipulation skills
    more efficiently. Starting with human demonstrations, the robot
    learns a probabilistic model of the phases and the phase
    transitions. The robot then employs model-based reinforcement
    learning to create a library of motor primitives for transitioning
    between phases. The learned motor primitives generalize to new
    situations and tasks. Given this library, the robot uses a value
    function approach to learn a high-level policy for sequencing
    the motor primitives. The proposed method was successfully
    evaluated on a real robot performing a bimanual grasping task.},
    url = {http://eprints.lincoln.ac.uk/25759/}
    }

  • R. Lioutikov, G. Neumann, G. Maeda, and J. Peters, “Probabilistic segmentation applied to an assembly task,” in 15th IEEE-RAS International Conference on Humanoid Robots, 2015, pp. 533-540.
    [BibTeX] [Abstract] [Download PDF]

    Movement primitives are a well established approach for encoding and executing robot movements. While the primitives themselves have been extensively researched, the concept of movement primitive libraries has not received as much attention. Libraries of movement primitives represent the skill set of an agent and can be queried and sequenced in order to solve specific tasks. The goal of this work is to segment unlabeled demonstrations into an optimal set of skills. Our novel approach segments the demonstrations while learning a probabilistic representation of movement primitives. The method differs from current approaches by taking advantage of the often neglected, mutual dependencies between the segments contained in the demonstrations and the primitives to be encoded. Therefore, improving the combined quality of both segmentation and skill learning. Furthermore, our method allows incorporating domain specific insights using heuristics, which are subsequently evaluated and assessed through probabilistic inference methods. We demonstrate our method on a real robot application, where the robot segments demonstrations of a chair assembly task into a skill library. The library is subsequently used to assemble the chair in an order not present in the demonstrations.

    @inproceedings{lirolem25751,
    booktitle = {15th IEEE-RAS International Conference on Humanoid Robots},
    volume = {2015-D},
    year = {2015},
    title = {Probabilistic segmentation applied to an assembly task},
    author = {R. Lioutikov and G. Neumann and G. Maeda and J. Peters},
    month = {November},
    pages = {533--540},
    journal = {IEEE-RAS International Conference on Humanoid Robots},
    keywords = {ARRAY(0x56147fc398b0)},
    url = {http://eprints.lincoln.ac.uk/25751/},
    abstract = {Movement primitives are a well established approach
    for encoding and executing robot movements. While
    the primitives themselves have been extensively researched, the
    concept of movement primitive libraries has not received as
    much attention. Libraries of movement primitives represent
    the skill set of an agent and can be queried and sequenced in
    order to solve specific tasks. The goal of this work is to segment
    unlabeled demonstrations into an optimal set of skills. Our
    novel approach segments the demonstrations while learning
    a probabilistic representation of movement primitives. The
    method differs from current approaches by taking advantage of
    the often neglected, mutual dependencies between the segments
    contained in the demonstrations and the primitives to be encoded.
    Therefore, improving the combined quality of both segmentation
    and skill learning. Furthermore, our method allows
    incorporating domain specific insights using heuristics, which
    are subsequently evaluated and assessed through probabilistic
    inference methods. We demonstrate our method on a real robot
    application, where the robot segments demonstrations of a chair
    assembly task into a skill library. The library is subsequently
    used to assemble the chair in an order not present in the
    demonstrations.}
    }

  • A. Paraschos, E. Rueckert, J. Peters, and G. Neumann, “Model-free Probabilistic Movement Primitives for physical interaction,” in IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), 2015, pp. 2860-2866.
    [BibTeX] [Abstract] [Download PDF]

    Physical interaction in robotics is a complex problem that requires not only accurate reproduction of the kinematic trajectories but also of the forces and torques exhibited during the movement. We base our approach on Movement Primitives (MP), as MPs provide a framework for modelling complex movements and introduce useful operations on the movements, such as generalization to novel situations, time scaling, and others. Usually, MPs are trained with imitation learning, where an expert demonstrates the trajectories. However, MPs used in physical interaction either require additional learning approaches, e.g., reinforcement learning, or are based on handcrafted solutions. Our goal is to learn and generate movements for physical interaction that are learned with imitation learning, from a small set of demonstrated trajectories. The Probabilistic Movement Primitives (ProMPs) framework is a recent MP approach that introduces beneficial properties, such as combination and blending of MPs, and represents the correlations present in the movement. The ProMPs provides a variable stiffness controller that reproduces the movement but it requires a dynamics model of the system. Learning such a model is not a trivial task, and, therefore, we introduce the model-free ProMPs, that are learning jointly the movement and the necessary actions from a few demonstrations. We derive a variable stiffness controller analytically. We further extent the ProMPs to include force and torque signals, necessary for physical interaction. We evaluate our approach in simulated and real robot tasks.

    @inproceedings{lirolem25752,
    journal = {IEEE International Conference on Intelligent Robots and Systems},
    month = {September},
    pages = {2860--2866},
    author = {A. Paraschos and E. Rueckert and J. Peters and G. Neumann},
    title = {Model-free Probabilistic Movement Primitives for physical interaction},
    year = {2015},
    booktitle = {IEEE/RSJ Conference on Intelligent Robots and Systems (IROS)},
    volume = {2015-D},
    keywords = {ARRAY(0x56147fc52660)},
    abstract = {Physical interaction in robotics is a complex problem
    that requires not only accurate reproduction of the kinematic
    trajectories but also of the forces and torques exhibited
    during the movement. We base our approach on Movement
    Primitives (MP), as MPs provide a framework for modelling
    complex movements and introduce useful operations on the
    movements, such as generalization to novel situations, time
    scaling, and others. Usually, MPs are trained with imitation
    learning, where an expert demonstrates the trajectories. However,
    MPs used in physical interaction either require additional
    learning approaches, e.g., reinforcement learning, or are based
    on handcrafted solutions. Our goal is to learn and generate
    movements for physical interaction that are learned with imitation
    learning, from a small set of demonstrated trajectories.
    The Probabilistic Movement Primitives (ProMPs) framework
    is a recent MP approach that introduces beneficial properties,
    such as combination and blending of MPs, and represents the
    correlations present in the movement. The ProMPs provides
    a variable stiffness controller that reproduces the movement
    but it requires a dynamics model of the system. Learning such
    a model is not a trivial task, and, therefore, we introduce the
    model-free ProMPs, that are learning jointly the movement and
    the necessary actions from a few demonstrations. We derive
    a variable stiffness controller analytically. We further extent
    the ProMPs to include force and torque signals, necessary for
    physical interaction. We evaluate our approach in simulated
    and real robot tasks.},
    url = {http://eprints.lincoln.ac.uk/25752/}
    }

  • A. Paraschos, G. Neumann, and J. Peters, “A probabilistic approach to robot trajectory generation,” in International Conference on Humanoid Robots (HUMANOIDS), 2015, pp. 477-483.
    [BibTeX] [Abstract] [Download PDF]

    Motor Primitives (MPs) are a promising approach for the data-driven acquisition as well as for the modular and re-usable generation of movements. However, a modular control architecture with MPs is only effective if the MPs support co-activation as well as continuously blending the activation from one MP to the next. In addition, we need efficient mechanisms to adapt a MP to the current situation. Common approaches to movement primitives lack such capabilities or their implementation is based on heuristics. We present a probabilistic movement primitive approach that overcomes the limitations of existing approaches. We encode a primitive as a probability distribution over trajectories. The representation as distribution has several beneficial properties. It allows encoding a time-varying variance profile. Most importantly, it allows performing new operations {–} a product of distributions for the co-activation of MPs conditioning for generalizing the MP to different desired targets. We derive a feedback controller that reproduces a given trajectory distribution in closed form. We compare our approach to the existing state-of-the art and present real robot results for learning from demonstration.

    @inproceedings{lirolem25755,
    journal = {IEEE-RAS International Conference on Humanoid Robots},
    pages = {477--483},
    month = {February},
    author = {A. Paraschos and Gerhard Neumann and J. Peters},
    number = {Februa},
    year = {2015},
    title = {A probabilistic approach to robot trajectory generation},
    volume = {2015-F},
    booktitle = {International Conference on Humanoid Robots (HUMANOIDS)},
    publisher = {IEEE},
    url = {http://eprints.lincoln.ac.uk/25755/},
    abstract = {Motor Primitives (MPs) are a promising approach
    for the data-driven acquisition as well as for the modular and
    re-usable generation of movements. However, a modular control
    architecture with MPs is only effective if the MPs support
    co-activation as well as continuously blending the activation
    from one MP to the next. In addition, we need efficient
    mechanisms to adapt a MP to the current situation. Common
    approaches to movement primitives lack such capabilities or
    their implementation is based on heuristics. We present a
    probabilistic movement primitive approach that overcomes the
    limitations of existing approaches. We encode a primitive as a
    probability distribution over trajectories. The representation as
    distribution has several beneficial properties. It allows encoding
    a time-varying variance profile. Most importantly, it allows
    performing new operations {--} a product of distributions for
    the co-activation of MPs conditioning for generalizing the MP
    to different desired targets. We derive a feedback controller
    that reproduces a given trajectory distribution in closed form.
    We compare our approach to the existing state-of-the art and
    present real robot results for learning from demonstration.},
    keywords = {ARRAY(0x56147fc526f0)}
    }

  • E. Rueckert, J. Mundo, A. Paraschos, J. Peters, and G. Neumann, “Extracting low-dimensional control variables for movement primitives,” in IEEE International Conference on Robotics and Automation 2015, 2015, pp. 1511-1518.
    [BibTeX] [Abstract] [Download PDF]

    Movement primitives (MPs) provide a powerful framework for data driven movement generation that has been successfully applied for learning from demonstrations and robot reinforcement learning. In robotics we often want to solve a multitude of different, but related tasks. As the parameters of the primitives are typically high dimensional, a common practice for the generalization of movement primitives to new tasks is to adapt only a small set of control variables, also called meta parameters, of the primitive. Yet, for most MP representations, the encoding of these control variables is pre-coded in the representation and can not be adapted to the considered tasks. In this paper, we want to learn the encoding of task-specific control variables also from data instead of relying on fixed meta-parameter representations. We use hierarchical Bayesian models (HBMs) to estimate a low dimensional latent variable model for probabilistic movement primitives (ProMPs), which is a recent movement primitive representation. We show on two real robot datasets that ProMPs based on HBMs outperform standard ProMPs in terms of generalization and learning from a small amount of data and also allows for an intuitive analysis of the movement. We also extend our HBM by a mixture model, such that we can model different movement types in the same dataset.

    @inproceedings{lirolem25760,
    volume = {2015-J},
    booktitle = {IEEE International Conference on Robotics and Automation 2015},
    pages = {1511--1518},
    month = {May},
    journal = {Proceedings - IEEE International Conference on Robotics and Automation},
    title = {Extracting low-dimensional control variables for movement primitives},
    year = {2015},
    author = {E. Rueckert and J. Mundo and A. Paraschos and J. Peters and Gerhard Neumann},
    number = {June},
    keywords = {ARRAY(0x56147fc52738)},
    url = {http://eprints.lincoln.ac.uk/25760/},
    abstract = {Movement primitives (MPs) provide a powerful framework for data driven movement generation that has been successfully applied for learning from demonstrations and robot reinforcement learning. In robotics we often want to solve a multitude of different, but related tasks. As the parameters of the primitives are typically high dimensional, a common practice for the generalization of movement primitives to new tasks is to adapt only a small set of control variables, also called meta parameters, of the primitive. Yet, for most MP representations, the encoding of these control variables is pre-coded in the representation and can not be adapted to the considered tasks. In this paper, we want to learn the encoding of task-specific control variables also from data instead of relying on fixed meta-parameter representations. We use hierarchical Bayesian models (HBMs) to estimate a low dimensional latent variable model for probabilistic movement primitives (ProMPs), which is a recent movement primitive representation. We show on two real robot datasets that ProMPs based on HBMs outperform standard ProMPs in terms of generalization and learning from a small amount of data and also allows for an intuitive analysis of the movement. We also extend our HBM by a mixture model, such that we can model different movement types in the same dataset.}
    }

2014

  • B. H. Amor, G. Neumann, S. Kamthe, O. Kroemer, and J. Peters, “Interaction primitives for human-robot cooperation tasks,” in 2014 IEEE International Conference on Robotics and Automation (ICRA 2014), 2014, pp. 2831-2837.
    [BibTeX] [Abstract] [Download PDF]

    To engage in cooperative activities with human partners, robots have to possess basic interactive abilities and skills. However, programming such interactive skills is a challenging task, as each interaction partner can have different timing or an alternative way of executing movements. In this paper, we propose to learn interaction skills by observing how two humans engage in a similar task. To this end, we introduce a new representation called Interaction Primitives. Interaction primitives build on the framework of dynamic motor primitives (DMPs) by maintaining a distribution over the parameters of the DMP. With this distribution, we can learn the inherent correlations of cooperative activities which allow us to infer the behavior of the partner and to participate in the cooperation. We will provide algorithms for synchronizing and adapting the behavior of humans and robots during joint physical activities.

    @inproceedings{lirolem25773,
    author = {H. Ben Amor and Gerhard Neumann and S. Kamthe and O. Kroemer and J. Peters},
    year = {2014},
    title = {Interaction primitives for human-robot cooperation tasks},
    journal = {Proceedings - IEEE International Conference on Robotics and Automation},
    pages = {2831--2837},
    month = {June},
    booktitle = {2014 IEEE International Conference on Robotics and Automation (ICRA 2014)},
    abstract = {To engage in cooperative activities with human
    partners, robots have to possess basic interactive abilities
    and skills. However, programming such interactive skills is a
    challenging task, as each interaction partner can have different
    timing or an alternative way of executing movements. In this
    paper, we propose to learn interaction skills by observing how
    two humans engage in a similar task. To this end, we introduce
    a new representation called Interaction Primitives. Interaction
    primitives build on the framework of dynamic motor primitives
    (DMPs) by maintaining a distribution over the parameters of
    the DMP. With this distribution, we can learn the inherent
    correlations of cooperative activities which allow us to infer the
    behavior of the partner and to participate in the cooperation.
    We will provide algorithms for synchronizing and adapting the
    behavior of humans and robots during joint physical activities.},
    url = {http://eprints.lincoln.ac.uk/25773/},
    keywords = {ARRAY(0x56147fc529a8)}
    }

  • A. Colome, G. Neumann, J. Peters, and C. Torras, “Dimensionality reduction for probabilistic movement primitives,” in Humanoid Robots (Humanoids), 2014 14th IEEE-RAS International Conference on, 2014, pp. 794-800.
    [BibTeX] [Abstract] [Download PDF]

    Humans as well as humanoid robots can use a large number of degrees of freedom to solve very complex motor tasks. The high-dimensionality of these motor tasks adds difficulties to the control problem and machine learning algorithms. However, it is well known that the intrinsic dimensionality of many human movements is small in comparison to the number of employed DoFs, and hence, the movements can be represented by a small number of synergies encoding the couplings between DoFs. In this paper, we want to apply Dimensionality Reduction (DR) to a recent movement representation used in robotics, called Probabilistic Movement Primitives (ProMP). While ProMP have been shown to have many benefits, they suffer with the high-dimensionality of a robotic system as the number of parameters of a ProMP scales quadratically with the dimensionality. We use probablistic dimensionality reduction techniques based on expectation maximization to extract the unknown synergies from a given set of demonstrations. The ProMP representation is now estimated in the low-dimensional space of the synergies. We show that our dimensionality reduction is more efficient both for encoding a trajectory from data and for applying Reinforcement Learning with Relative Entropy Policy Search (REPS).

    @inproceedings{lirolem25756,
    volume = {2015-F},
    booktitle = {Humanoid Robots (Humanoids), 2014 14th IEEE-RAS International Conference on},
    author = {A. Colome and G. Neumann and J. Peters and C. Torras},
    year = {2014},
    title = {Dimensionality reduction for probabilistic movement primitives},
    journal = {IEEE-RAS International Conference on Humanoid Robots},
    pages = {794--800},
    month = {November},
    url = {http://eprints.lincoln.ac.uk/25756/},
    abstract = {Humans as well as humanoid robots can use a large number of degrees of freedom to solve very complex motor tasks. The high-dimensionality of these motor tasks adds difficulties to the control problem and machine learning algorithms. However, it is well known that the intrinsic dimensionality of many human movements is small in comparison to the number of employed DoFs, and hence, the movements can be represented by a small number of synergies encoding the couplings between DoFs. In this paper, we want to apply Dimensionality Reduction (DR) to a recent movement representation used in robotics, called Probabilistic Movement Primitives (ProMP). While ProMP have been shown to have many benefits, they suffer with the high-dimensionality of a robotic system as the number of parameters of a ProMP scales quadratically with the dimensionality. We use probablistic dimensionality reduction techniques based on expectation maximization to extract the unknown synergies from a given set of demonstrations. The ProMP representation is now estimated in the low-dimensional space of the synergies. We show that our dimensionality reduction is more efficient both for encoding a trajectory from data and for applying Reinforcement Learning with Relative Entropy Policy Search (REPS).},
    keywords = {ARRAY(0x56147fc527e0)}
    }

  • C. Dann, G. Neumann, and J. Peters, “Policy evaluation with temporal differences: a survey and comparison,” Journal of Machine Learning Research, vol. 15, pp. 809-883, 2014.
    [BibTeX] [Abstract] [Download PDF]

    Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches. This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.

    @article{lirolem25768,
    journal = {Journal of Machine Learning Research},
    pages = {809--883},
    month = {March},
    author = {C. Dann and G. Neumann and J. Peters},
    year = {2014},
    title = {Policy evaluation with temporal differences: a survey and comparison},
    volume = {15},
    publisher = {Massachusetts Institute of Technology Press (MIT Press) / Microtome Publishing},
    abstract = {Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches.
    This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.},
    url = {http://eprints.lincoln.ac.uk/25768/},
    keywords = {ARRAY(0x56147fc529d8)}
    }

  • V. Gomez, H. J. Kappen, J. Peters, and G. Neumann, “Policy search for path integral control,” in Machine Learning and Knowledge Discovery in Databases – European Conference, ECML/PKDD 2014, 2014, pp. 482-497.
    [BibTeX] [Abstract] [Download PDF]

    Path integral (PI) control defines a general class of control problems for which the optimal control computation is equivalent to an inference problem that can be solved by evaluation of a path integral over state trajectories. However, this potential is mostly unused in real-world problems because of two main limitations: first, current approaches can typically only be applied to learn open-loop controllers and second, current sampling procedures are inefficient and not scalable to high dimensional systems. We introduce the efficient Path Integral Relative-Entropy Policy Search (PI-REPS) algorithm for learning feedback policies with PI control. Our algorithm is inspired by information theoretic policy updates that are often used in policy search. We use these updates to approximate the state trajectory distribution that is known to be optimal from the PI control theory. Our approach allows for a principled treatment of different sampling distributions and can be used to estimate many types of parametric or non-parametric feedback controllers. We show that PI-REPS significantly outperforms current methods and is able to solve tasks that are out of reach for current methods.

    @inproceedings{lirolem25770,
    volume = {8724 L},
    booktitle = {Machine Learning and Knowledge Discovery in Databases - European Conference, ECML/PKDD 2014},
    publisher = {Springer},
    journal = {Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)},
    pages = {482--497},
    author = {Vincenc Gomez and Hilbert J. Kappen and Jan Peters and Gerhard Neumann},
    number = {PART 1},
    year = {2014},
    title = {Policy search for path integral control},
    url = {http://eprints.lincoln.ac.uk/25770/},
    abstract = {Path integral (PI) control defines a general class of control problems for which the optimal control computation is equivalent to an inference problem that can be solved by evaluation of a path integral over state trajectories. However, this potential is mostly unused in real-world problems because of two main limitations: first, current approaches can typically only be applied to learn open-loop controllers and second, current sampling procedures are inefficient and not scalable to high dimensional systems. We introduce the efficient Path Integral Relative-Entropy Policy Search (PI-REPS) algorithm for learning feedback policies with PI control. Our algorithm is inspired by information theoretic policy updates that are often used in policy search. We use these updates to approximate the state trajectory distribution that is known to be optimal from the PI control theory. Our approach allows for a principled treatment of different sampling distributions and can be used to estimate many types of parametric or non-parametric feedback controllers. We show that PI-REPS significantly outperforms current methods and is able to solve tasks that are out of reach for current methods.},
    keywords = {ARRAY(0x56147fc52930)}
    }

  • O. Kroemer, V. H. Hoof, G. Neumann, and J. Peters, “Learning to predict phases of manipulation tasks as hidden states,” in 2014 IEEE International Conference on Robotics and Automation, 2014, pp. 4009-4014.
    [BibTeX] [Abstract] [Download PDF]

    Phase transitions in manipulation tasks often occur when contacts between objects are made or broken. A switch of the phase can result in the robot?s actions suddenly influencing different aspects of its environment. Therefore, the boundaries between phases often correspond to constraints or subgoals of the manipulation task. In this paper, we investigate how the phases of manipulation tasks can be learned from data. The task is modeled as an autoregressive hidden Markov model, wherein the hidden phase transitions depend on the observed states. The model is learned from data using the expectation-maximization algorithm. We demonstrate the proposed method on both a pushing task and a pepper mill turning task. The proposed approach was compared to a standard autoregressive hidden Markov model. The experiments show that the learned models can accurately predict the transitions in phases during the manipulation tasks.

    @inproceedings{lirolem25769,
    title = {Learning to predict phases of manipulation tasks as hidden states},
    year = {2014},
    author = {O. Kroemer and H. Van Hoof and G. Neumann and J. Peters},
    booktitle = {2014 IEEE International Conference on Robotics and Automation},
    month = {September},
    pages = {4009--4014},
    journal = {Proceedings - IEEE International Conference on Robotics and Automation},
    abstract = {Phase transitions in manipulation tasks often occur
    when contacts between objects are made or broken. A
    switch of the phase can result in the robot?s actions suddenly
    influencing different aspects of its environment. Therefore, the
    boundaries between phases often correspond to constraints or
    subgoals of the manipulation task.
    In this paper, we investigate how the phases of manipulation
    tasks can be learned from data. The task is modeled as an
    autoregressive hidden Markov model, wherein the hidden phase
    transitions depend on the observed states. The model is learned
    from data using the expectation-maximization algorithm. We
    demonstrate the proposed method on both a pushing task
    and a pepper mill turning task. The proposed approach was
    compared to a standard autoregressive hidden Markov model.
    The experiments show that the learned models can accurately
    predict the transitions in phases during the manipulation tasks.},
    url = {http://eprints.lincoln.ac.uk/25769/},
    keywords = {ARRAY(0x56147fc528b8)}
    }

  • R. Lioutikov, A. Paraschos, J. Peters, and G. Neumann, “Generalizing movements with information-theoretic stochastic optimal control,” Journal of Aerospace Information Systems, vol. 11, iss. 9, pp. 579-595, 2014.
    [BibTeX] [Abstract] [Download PDF]

    Stochastic optimal control is typically used to plan a movement for a specific situation. Although most stochastic optimal control methods fail to generalize this movement plan to a new situation without replanning, a stochastic optimal control method is presented that allows reuse of the obtained policy in a new situation, as the policy is more robust to slight deviations from the initial movement plan. To improve the robustness of the policy, we employ information-theoretic policy updates that explicitly operate on trajectory distributions instead of single trajectories. To ensure a stable and smooth policy update, the ?distance? is limited between the trajectory distributions of the old and the new control policies. The introduced bound offers a closed-form solution for the resulting policy and extends results from recent developments in stochastic optimal control. In contrast to many standard stochastic optimal control algorithms, the current approach can directly infer the system dynamics from data points, and hence can also be used for model-based reinforcement learning. This paper represents an extension of the paper by Lioutikov et al. (?Sample-Based Information-Theoretic Stochastic Optimal Control,? Proceedings of 2014 IEEE International Conference on Robotics and Automation (ICRA), IEEE, Piscataway, NJ, 2014, pp. 3896?3902). In addition to revisiting the content, an extensive theoretical comparison is presented of the approach with related work, additional aspects of the implementation are discussed, and further evaluations are introduced.

    @article{lirolem25767,
    publisher = {American Institute of Aeronautics and Astronautics},
    volume = {11},
    year = {2014},
    title = {Generalizing movements with information-theoretic stochastic optimal control},
    author = {R. Lioutikov and A. Paraschos and J. Peters and G. Neumann},
    number = {9},
    pages = {579--595},
    month = {September},
    journal = {Journal of Aerospace Information Systems},
    keywords = {ARRAY(0x56147fc52918)},
    abstract = {Stochastic optimal control is typically used to plan a movement for a specific situation. Although most stochastic optimal control methods fail to generalize this movement plan to a new situation without replanning, a stochastic optimal control method is presented that allows reuse of the obtained policy in a new situation, as the policy is more robust to slight deviations from the initial movement plan. To improve the robustness of the policy, we employ information-theoretic policy updates that explicitly operate on trajectory distributions instead of single trajectories. To ensure a stable and smooth policy update, the ?distance? is limited between the trajectory distributions of the old and the new control policies. The introduced bound offers a closed-form solution for the resulting policy and extends results from recent developments in stochastic optimal control. In contrast to many standard stochastic optimal control algorithms, the current approach can directly infer the system dynamics from data points, and hence can also be used for model-based reinforcement learning. This paper represents an extension of the paper by Lioutikov et al. (?Sample-Based Information-Theoretic Stochastic Optimal Control,? Proceedings of 2014 IEEE International Conference on Robotics and Automation (ICRA), IEEE, Piscataway, NJ, 2014, pp. 3896?3902). In addition to revisiting the content, an extensive theoretical comparison is presented of the approach with related work, additional aspects of the implementation are discussed, and further evaluations are introduced.},
    url = {http://eprints.lincoln.ac.uk/25767/}
    }

  • R. Lioutikov, A. Paraschos, J. Peters, and G. Neumann, “Sample-based information-theoretic stochastic optimal control,” in Proceedings of 2014 IEEE International Conference on Robotics and Automation, 2014, pp. 3896-3902.
    [BibTeX] [Abstract] [Download PDF]

    Many Stochastic Optimal Control (SOC) approaches rely on samples to either obtain an estimate of the value function or a linearisation of the underlying system model. However, these approaches typically neglect the fact that the accuracy of the policy update depends on the closeness of the resulting trajectory distribution to these samples. The greedy operator does not consider such closeness constraint to the samples. Hence, the greedy operator can lead to oscillations or even instabilities in the policy updates. Such undesired behaviour is likely to result in an inferior performance of the estimated policy. We reuse inspiration from the reinforcement learning community and relax the greedy operator used in SOC with an information theoretic bound that limits the ?distance? of two subsequent trajectory distributions in a policy update. The introduced bound ensures a smooth and stable policy update. Our method is also well suited for model-based reinforcement learning, where we estimate the system dynamics model from data. As this model is likely to be inaccurate, it might be dangerous to exploit the model greedily. Instead, our bound ensures that we generate new data in the vicinity of the current data, such that we can improve our estimate of the system dynamics model. We show that our approach outperforms several state of the art approaches on challenging simulated robot control tasks.

    @inproceedings{lirolem25771,
    journal = {Proceedings - IEEE International Conference on Robotics and Automation},
    pages = {3896--3902},
    month = {September},
    booktitle = {Proceedings of 2014 IEEE International Conference on Robotics and Automation},
    author = {R. Lioutikov and A. Paraschos and J. Peters and G. Neumann},
    year = {2014},
    title = {Sample-based information-theoretic stochastic optimal control},
    url = {http://eprints.lincoln.ac.uk/25771/},
    abstract = {Many Stochastic Optimal Control (SOC) approaches
    rely on samples to either obtain an estimate of the
    value function or a linearisation of the underlying system model.
    However, these approaches typically neglect the fact that the
    accuracy of the policy update depends on the closeness of the
    resulting trajectory distribution to these samples. The greedy
    operator does not consider such closeness constraint to the
    samples. Hence, the greedy operator can lead to oscillations
    or even instabilities in the policy updates. Such undesired
    behaviour is likely to result in an inferior performance of the
    estimated policy. We reuse inspiration from the reinforcement
    learning community and relax the greedy operator used in SOC
    with an information theoretic bound that limits the ?distance? of
    two subsequent trajectory distributions in a policy update. The
    introduced bound ensures a smooth and stable policy update.
    Our method is also well suited for model-based reinforcement
    learning, where we estimate the system dynamics model from
    data. As this model is likely to be inaccurate, it might be
    dangerous to exploit the model greedily. Instead, our bound
    ensures that we generate new data in the vicinity of the current
    data, such that we can improve our estimate of the system
    dynamics model. We show that our approach outperforms
    several state of the art approaches on challenging simulated
    robot control tasks.},
    keywords = {ARRAY(0x56147fc528e8)}
    }

  • K. S. Luck, G. Neumann, E. Berger, J. Peters, and H. B. Amor, “Latent space policy search for robotics,” in IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), 2014, pp. 1434-1440.
    [BibTeX] [Abstract] [Download PDF]

    Learning motor skills for robots is a hard task. In particular, a high number of degrees-of-freedom in the robot can pose serious challenges to existing reinforcement learning methods, since it leads to a highdimensional search space. However, complex robots are often intrinsically redundant systems and, therefore, can be controlled using a latent manifold of much smaller dimensionality. In this paper, we present a novel policy search method that performs efficient reinforcement learning by uncovering the low-dimensional latent space of actuator redundancies. In contrast to previous attempts at combining reinforcement learning and dimensionality reduction, our approach does not perform dimensionality reduction as a preprocessing step but naturally combines it with policy search. Our evaluations show that the new approach outperforms existing algorithms for learning motor skills with high-dimensional robots.

    @inproceedings{lirolem25772,
    pages = {1434--1440},
    booktitle = {IEEE/RSJ Conference on Intelligent Robots and Systems (IROS)},
    month = {September},
    journal = {IEEE International Conference on Intelligent Robots and Systems},
    title = {Latent space policy search for robotics},
    year = {2014},
    author = {K. S. Luck and G. Neumann and E. Berger and J. Peters and H. B. Amor},
    keywords = {ARRAY(0x56147fc52948)},
    abstract = {Learning motor skills for robots is a hard
    task. In particular, a high number of degrees-of-freedom
    in the robot can pose serious challenges to existing reinforcement
    learning methods, since it leads to a highdimensional
    search space. However, complex robots are
    often intrinsically redundant systems and, therefore, can
    be controlled using a latent manifold of much smaller
    dimensionality. In this paper, we present a novel policy
    search method that performs efficient reinforcement learning
    by uncovering the low-dimensional latent space of
    actuator redundancies. In contrast to previous attempts
    at combining reinforcement learning and dimensionality
    reduction, our approach does not perform dimensionality
    reduction as a preprocessing step but naturally combines
    it with policy search. Our evaluations show that the new
    approach outperforms existing algorithms for learning
    motor skills with high-dimensional robots.},
    url = {http://eprints.lincoln.ac.uk/25772/}
    }

  • G. Maeda, M. Ewerton, R. Lioutikov, B. H. Amor, J. Peters, and G. Neumann, “Learning interaction for collaborative tasks with probabilistic movement primitives,” in 14th IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2014, pp. 527-534.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a probabilistic framework based on movement primitives for robots that work in collaboration with a human coworker. Since the human coworker can execute a variety of unforeseen tasks a requirement of our system is that the robot assistant must be able to adapt and learn new skills on-demand, without the need of an expert programmer. Thus, this paper leverages on the framework of imitation learning and its application to human-robot interaction using the concept of Interaction Primitives (IPs). We introduce the use of Probabilistic Movement Primitives (ProMPs) to devise an interaction method that both recognizes the action of a human and generates the appropriate movement primitive of the robot assistant. We evaluate our method on experiments using a lightweight arm interacting with a human partner and also using motion capture trajectories of two humans assembling a box. The advantages of ProMPs in relation to the original formulation for interaction are exposed and compared.

    @inproceedings{lirolem25764,
    author = {G. Maeda and M. Ewerton and R. Lioutikov and H. Ben Amor and J. Peters and G. Neumann},
    title = {Learning interaction for collaborative tasks with probabilistic movement primitives},
    year = {2014},
    journal = {IEEE-RAS International Conference on Humanoid Robots},
    month = {November},
    pages = {527--534},
    booktitle = {14th IEEE-RAS International Conference on Humanoid Robots (Humanoids)},
    volume = {2015-F},
    url = {http://eprints.lincoln.ac.uk/25764/},
    abstract = {This paper proposes a probabilistic framework
    based on movement primitives for robots that work in collaboration
    with a human coworker. Since the human coworker
    can execute a variety of unforeseen tasks a requirement of our
    system is that the robot assistant must be able to adapt and
    learn new skills on-demand, without the need of an expert
    programmer. Thus, this paper leverages on the framework
    of imitation learning and its application to human-robot interaction
    using the concept of Interaction Primitives (IPs).
    We introduce the use of Probabilistic Movement Primitives
    (ProMPs) to devise an interaction method that both recognizes
    the action of a human and generates the appropriate movement
    primitive of the robot assistant. We evaluate our method
    on experiments using a lightweight arm interacting with a
    human partner and also using motion capture trajectories of
    two humans assembling a box. The advantages of ProMPs in
    relation to the original formulation for interaction are exposed
    and compared.},
    keywords = {ARRAY(0x56147fc52810)}
    }

  • G. Neumann, C. Daniel, A. Paraschos, A. Kupcsik, and J. Peters, “Learning modular policies for robotics,” Frontiers in Computational Neuroscience, vol. 8, iss. JUN, 2014.
    [BibTeX] [Abstract] [Download PDF]

    A promising idea for scaling robot learning to more complex tasks is to use elemental behaviors as building blocks to compose more complex behavior. Ideally, such building blocks are used in combination with a learning algorithm that is able to learn to select, adapt, sequence and co-activate the building blocks. While there has been a lot of work on approaches that support one of these requirements, no learning algorithm exists that unifies all these properties in one framework. In this paper we present our work on a unified approach for learning such a modular control architecture. We introduce new policy search algorithms that are based on information-theoretic principles and are able to learn to select, adapt and sequence the building blocks. Furthermore, we developed a new representation for the individual building block that supports co-activation and principled ways for adapting the movement. Finally, we summarize our experiments for learning modular control architectures in simulation and with real robots.

    @article{lirolem25765,
    journal = {Frontiers in Computational Neuroscience},
    month = {June},
    number = {JUN},
    author = {G. Neumann and C. Daniel and A. Paraschos and A. Kupcsik and J. Peters},
    title = {Learning modular policies for robotics},
    year = {2014},
    volume = {8},
    publisher = {Frontiers Media},
    abstract = {A promising idea for scaling robot learning to more complex tasks is to use elemental behaviors as building blocks to compose more complex behavior. Ideally, such building blocks are used in combination with a learning algorithm that is able to learn to select, adapt, sequence and co-activate the building blocks. While there has been a lot of work on approaches that support one of these requirements, no learning algorithm exists that unifies all these properties in one framework. In this paper we present our work on a unified approach for learning such a modular control architecture. We introduce new policy search algorithms that are based on information-theoretic principles and are able to learn to select, adapt and sequence the building blocks. Furthermore, we developed a new representation for the individual building block that supports co-activation and principled ways for adapting the movement. Finally, we summarize our experiments for learning modular control architectures in simulation and with real robots.},
    url = {http://eprints.lincoln.ac.uk/25765/},
    keywords = {ARRAY(0x56147fc52990)}
    }

  • E. Rueckert, M. Mindt, J. Peters, and G. Neumann, “Robust policy updates for stochastic optimal control,” in Humanoid Robots (Humanoids), 2014 14th IEEE-RAS International Conference on, 2014, pp. 388-393.
    [BibTeX] [Abstract] [Download PDF]

    For controlling high-dimensional robots, most stochastic optimal control algorithms use approximations of the system dynamics and of the cost function (e.g., using linearizations and Taylor expansions). These approximations are typically only locally correct, which might cause instabilities in the greedy policy updates, lead to oscillations or the algorithms diverge. To overcome these drawbacks, we add a regularization term to the cost function that punishes large policy update steps in the trajectory optimization procedure. We applied this concept to the Approximate Inference Control method (AICO), where the resulting algorithm guarantees convergence for uninformative initial solutions without complex hand-tuning of learning rates. We evaluated our new algorithm on two simulated robotic platforms. A robot arm with five joints was used for reaching multiple targets while keeping the roll angle constant. On the humanoid robot Nao, we show how complex skills like reaching and balancing can be inferred from desired center of gravity or end effector coordinates.

    @inproceedings{lirolem25754,
    volume = {2015-F},
    booktitle = {Humanoid Robots (Humanoids), 2014 14th IEEE-RAS International Conference on},
    year = {2014},
    title = {Robust policy updates for stochastic optimal control},
    author = {E. Rueckert and M. Mindt and J. Peters and G. Neumann},
    pages = {388--393},
    month = {November},
    journal = {IEEE-RAS International Conference on Humanoid Robots},
    keywords = {ARRAY(0x56147fc526a8)},
    url = {http://eprints.lincoln.ac.uk/25754/},
    abstract = {For controlling high-dimensional robots, most stochastic optimal control algorithms use approximations of the system dynamics and of the cost function (e.g., using linearizations and Taylor expansions). These approximations are typically only locally correct, which might cause instabilities in the greedy policy updates, lead to oscillations or the algorithms diverge. To overcome these drawbacks, we add a regularization term to the cost function that punishes large policy update steps in the trajectory optimization procedure. We applied this concept to the Approximate Inference Control method (AICO), where the resulting algorithm guarantees convergence for uninformative initial solutions without complex hand-tuning of learning rates. We evaluated our new algorithm on two simulated robotic platforms. A robot arm with five joints was used for reaching multiple targets while keeping the roll angle constant. On the humanoid robot Nao, we show how complex skills like reaching and balancing can be inferred from desired center of gravity or end effector coordinates.}
    }

2013

  • C. Daniel, G. Neumann, O. Kroemer, and J. Peters, “Learning sequential motor tasks,” in IEEE International Conference on Robotics and Automation, 2013, pp. 2626-2632.
    [BibTeX] [Abstract] [Download PDF]

    Many real robot applications require the sequential use of multiple distinct motor primitives. This requirement implies the need to learn the individual primitives as well as a strategy to select the primitives sequentially. Such hierarchical learning problems are commonly either treated as one complex monolithic problem which is hard to learn, or as separate tasks learned in isolation. However, there exists a strong link between the robots strategy and its motor primitives. Consequently, a consistent framework is needed that can learn jointly on the level of the individual primitives and the robots strategy. We present a hierarchical learning method which improves individual motor primitives and, simultaneously, learns how to combine these motor primitives sequentially to solve complex motor tasks. We evaluate our method on the game of robot hockey, which is both difficult to learn in terms of the required motor primitives as well as its strategic elements.

    @inproceedings{lirolem25781,
    journal = {Proceedings - IEEE International Conference on Robotics and Automation},
    pages = {2626--2632},
    month = {May},
    author = {C. Daniel and G. Neumann and O. Kroemer and J. Peters},
    title = {Learning sequential motor tasks},
    year = {2013},
    note = {cited By 3},
    booktitle = {IEEE International Conference on Robotics and Automation},
    abstract = {Many real robot applications require the sequential use of multiple distinct motor primitives. This requirement implies the need to learn the individual primitives as well as a strategy to select the primitives sequentially. Such hierarchical learning problems are commonly either treated as one complex monolithic problem which is hard to learn, or as separate tasks learned in isolation. However, there exists a strong link between the robots strategy and its motor primitives. Consequently, a consistent framework is needed that can learn jointly on the level of the individual primitives and the robots strategy. We present a hierarchical learning method which improves individual motor primitives and, simultaneously, learns how to combine these motor primitives sequentially to solve complex motor tasks. We evaluate our method on the game of robot hockey, which is both difficult to learn in terms of the required motor primitives as well as its strategic elements.},
    url = {http://eprints.lincoln.ac.uk/25781/},
    keywords = {ARRAY(0x56147fc52ab0)}
    }

  • M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and Trends in Robotics, vol. 2, iss. 1-2, pp. 388-403, 2013.
    [BibTeX] [Abstract] [Download PDF]

    Policy search is a subfield in reinforcement learning which focuses on finding good parameters for a given policy parametrization. It is well suited for robotics as it can cope with high-dimensional state and action spaces, one of the main challenges in robot learning. We review recent successes of both model-free and model-based policy search in robot learning. Model-free policy search is a general approach to learn policies based on sampled trajectories. We classify model-free methods based on their policy evaluation strategy, policy update strategy, and exploration strategy and present a unified view on existing algorithms. Learning a policy is often easier than learning an accurate forward model, and, hence, model-free methods are more frequently used in practice. However, for each sampled trajectory, it is necessary to interact with the * Both authors contributed equally. robot, which can be time consuming and challenging in practice. Modelbased policy search addresses this problem by first learning a simulator of the robot?s dynamics from data. Subsequently, the simulator generates trajectories that are used for policy learning. For both modelfree and model-based policy search methods, we review their respective properties and their applicability to robotic systems.

    @article{lirolem28029,
    publisher = {Now Publishers},
    volume = {2},
    year = {2013},
    title = {A survey on policy search for robotics},
    author = {M. P. Deisenroth and G. Neumann and J. Peters},
    number = {1-2},
    pages = {388--403},
    month = {August},
    journal = {Foundations and Trends in Robotics},
    keywords = {ARRAY(0x56147fc52a50)},
    url = {http://eprints.lincoln.ac.uk/28029/},
    abstract = {Policy search is a subfield in reinforcement learning which focuses on
    finding good parameters for a given policy parametrization. It is well
    suited for robotics as it can cope with high-dimensional state and action
    spaces, one of the main challenges in robot learning. We review recent
    successes of both model-free and model-based policy search in robot
    learning.
    Model-free policy search is a general approach to learn policies
    based on sampled trajectories. We classify model-free methods based on
    their policy evaluation strategy, policy update strategy, and exploration
    strategy and present a unified view on existing algorithms. Learning a
    policy is often easier than learning an accurate forward model, and,
    hence, model-free methods are more frequently used in practice. However,
    for each sampled trajectory, it is necessary to interact with the
    * Both authors contributed equally.
    robot, which can be time consuming and challenging in practice. Modelbased
    policy search addresses this problem by first learning a simulator
    of the robot?s dynamics from data. Subsequently, the simulator generates
    trajectories that are used for policy learning. For both modelfree
    and model-based policy search methods, we review their respective
    properties and their applicability to robotic systems.}
    }

  • A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, “Data-efficient generalization of robot skills with contextual policy search,” Proceedings of the 27th AAAI Conference on Artificial Intelligence, AAAI 2013, pp. 1401-1407, 2013.
    [BibTeX] [Abstract] [Download PDF]

    In robotics, controllers make the robot solve a task within a specific context. The context can describe the objectives of the robot or physical properties of the environment and is always specified before task execution. To generalize the controller to multiple contexts, we follow a hierarchical approach for policy learning: A lower-level policy controls the robot for a given context and an upper-level policy generalizes among contexts. Current approaches for learning such upper-level policies are based on model-free policy search, which require an excessive number of interactions of the robot with its environment. More data-efficient policy search approaches are model based but, thus far, without the capability of learning hierarchical policies. We propose a new model-based policy search approach that can also learn contextual upper-level policies. Our approach is based on learning probabilistic forward models for long-term predictions. Using these redictions, we use information-theoretic insights to improve the upper-level policy. Our method achieves a substantial improvement in learning speed compared to existing methods on simulated and real robotic tasks.

    @article{lirolem25777,
    title = {Data-efficient generalization of robot skills with contextual policy search},
    year = {2013},
    author = {A. G. Kupcsik and M. P. Deisenroth and J. Peters and Gerhard Neumann},
    pages = {1401--1407},
    month = {July},
    note = {27th AAAI Conference on Artificial Intelligence, AAAI 2013; Bellevue, WA; United States; 14 - 18 July 2013},
    journal = {Proceedings of the 27th AAAI Conference on Artificial Intelligence, AAAI 2013},
    keywords = {ARRAY(0x56147fc52a80)},
    abstract = {In robotics, controllers make the robot solve a task within a specific context. The context can describe the objectives of
    the robot or physical properties of the environment and is always specified before task execution. To generalize the controller to multiple contexts, we follow a hierarchical approach for policy learning: A lower-level policy controls the robot for a given context and an upper-level policy generalizes among contexts. Current approaches for learning such upper-level policies are based on model-free policy search, which require an excessive number of interactions of the robot with its environment.
    More data-efficient policy search approaches are model based but, thus far, without the capability of learning
    hierarchical policies. We propose a new model-based policy search approach that can also learn contextual upper-level
    policies. Our approach is based on learning probabilistic forward models for long-term predictions. Using these redictions, we use information-theoretic insights to improve the upper-level policy. Our method achieves a substantial improvement in learning speed compared to existing methods on simulated and real robotic tasks.},
    url = {http://eprints.lincoln.ac.uk/25777/}
    }

  • A. Paraschos, G. Neumann, and J. Peters, “A probabilistic approach to robot trajectory generation,” in 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2013, pp. 477-483.
    [BibTeX] [Abstract] [Download PDF]

    Motor Primitives (MPs) are a promising approach for the data-driven acquisition as well as for the modular and re-usable generation of movements. However, a modular control architecture with MPs is only effective if the MPs support co-activation as well as continuously blending the activation from one MP to the next. In addition, we need efficient mechanisms to adapt a MP to the current situation. Common approaches to movement primitives lack such capabilities or their implementation is based on heuristics. We present a probabilistic movement primitive approach that overcomes the limitations of existing approaches. We encode a primitive as a probability distribution over trajectories. The representation as distribution has several beneficial properties. It allows encoding a time-varying variance profile. Most importantly, it allows performing new operations – a product of distributions for the co-activation of MPs conditioning for generalizing the MP to different desired targets. We derive a feedback controller that reproduces a given trajectory distribution in closed form. We compare our approach to the existing state-of-the art and present real robot results for learning from demonstration.

    @inproceedings{lirolem25693,
    month = {October},
    pages = {477--483},
    title = {A probabilistic approach to robot trajectory generation},
    year = {2013},
    number = {Februa},
    author = {A. Paraschos and G. Neumann and J. Peters},
    booktitle = {13th IEEE-RAS International Conference on Humanoid Robots (Humanoids)},
    volume = {2015-F},
    publisher = {IEEE},
    keywords = {ARRAY(0x56147fc52a20)},
    url = {http://eprints.lincoln.ac.uk/25693/},
    abstract = {Motor Primitives (MPs) are a promising approach for the data-driven acquisition as well as for the modular and re-usable generation of movements. However, a modular control architecture with MPs is only effective if the MPs support co-activation as well as continuously blending the activation from one MP to the next. In addition, we need efficient mechanisms to adapt a MP to the current situation. Common approaches to movement primitives lack such capabilities or their implementation is based on heuristics. We present a probabilistic movement primitive approach that overcomes the limitations of existing approaches. We encode a primitive as a probability distribution over trajectories. The representation as distribution has several beneficial properties. It allows encoding a time-varying variance profile. Most importantly, it allows performing new operations - a product of distributions for the co-activation of MPs conditioning for generalizing the MP to different desired targets. We derive a feedback controller that reproduces a given trajectory distribution in closed form. We compare our approach to the existing state-of-the art and present real robot results for learning from demonstration.}
    }

  • A. Paraschos, C. Daniel, J. Peters, and G. Neumann, “Probabilistic movement primitives,” in Advances in Neural Information Processing Systems, (NIPS), 2013.
    [BibTeX] [Abstract] [Download PDF]

    Movement Primitives (MP) are a well-established approach for representing modular and re-usable robot movement generators. Many state-of-the-art robot learning successes are based MPs, due to their compact representation of the inherently continuous and high dimensional robot movements. A major goal in robot learning is to combine multiple MPs as building blocks in a modular control architecture to solve complex tasks. To this effect, a MP representation has to allow for blending between motions, adapting to altered task variables, and co-activating multiple MPs in parallel. We present a probabilistic formulation of the MP concept that maintains a distribution over trajectories. Our probabilistic approach allows for the derivation of new operations which are essential for implementing all aforementioned properties in one framework. In order to use such a trajectory distribution for robot movement control, we analytically derive a stochastic feedback controller which reproduces the given trajectory distribution. We evaluate and compare our approach to existing methods on several simulated as well as real robot scenarios.

    @inproceedings{lirolem25785,
    journal = {Advances in Neural Information Processing Systems},
    month = {December},
    booktitle = {Advances in Neural Information Processing Systems, (NIPS)},
    author = {A. Paraschos and C. Daniel and J. Peters and G. Neumann},
    year = {2013},
    title = {Probabilistic movement primitives},
    abstract = {Movement Primitives (MP) are a well-established approach for representing modular
    and re-usable robot movement generators. Many state-of-the-art robot learning
    successes are based MPs, due to their compact representation of the inherently
    continuous and high dimensional robot movements. A major goal in robot learning
    is to combine multiple MPs as building blocks in a modular control architecture
    to solve complex tasks. To this effect, a MP representation has to allow for
    blending between motions, adapting to altered task variables, and co-activating
    multiple MPs in parallel. We present a probabilistic formulation of the MP concept
    that maintains a distribution over trajectories. Our probabilistic approach
    allows for the derivation of new operations which are essential for implementing
    all aforementioned properties in one framework. In order to use such a trajectory
    distribution for robot movement control, we analytically derive a stochastic feedback
    controller which reproduces the given trajectory distribution. We evaluate
    and compare our approach to existing methods on several simulated as well as
    real robot scenarios.},
    url = {http://eprints.lincoln.ac.uk/25785/},
    keywords = {ARRAY(0x56147fc528a0)}
    }

  • E. A. Rueckert, G. Neumann, M. Toussaint, and W. Maass, “Learned graphical models for probabilistic planning provide a new class of movement primitives,” Frontiers in Computational Neuroscience, vol. 6, 2013.
    [BibTeX] [Abstract] [Download PDF]

    Biological movement generation combines three interesting aspects: its modular organization in movement primitives (MPs), its characteristics of stochastic optimality under perturbations, and its efficiency in terms of learning. A common approach to motor skill learning is to endow the primitives with dynamical systems. Here, the parameters of the primitive indirectly define the shape of a reference trajectory. We propose an alternative MP representation based on probabilistic inference in learned graphical models with new and interesting properties that complies with salient features of biological movement control. Instead of endowing the primitives with dynamical systems, we propose to endow MPs with an intrinsic probabilistic planning system, integrating the power of stochastic optimal control (SOC) methods within a MP. The parameterization of the primitive is a graphical model that represents the dynamics and intrinsic cost function such that inference in this graphical model yields the control policy. We parameterize the intrinsic cost function using task-relevant features, such as the importance of passing through certain via-points. The system dynamics as well as intrinsic cost function parameters are learned in a reinforcement learning (RL) setting. We evaluate our approach on a complex 4-link balancing task. Our experiments show that our movement representation facilitates learning significantly and leads to better generalization to new task settings without re-learning.

    @article{lirolem25789,
    month = {January},
    volume = {6},
    journal = {Frontiers in Computational Neuroscience},
    publisher = {Frontiers Media},
    title = {Learned graphical models for probabilistic planning provide a new class of movement primitives},
    year = {2013},
    author = {Elmar A. Rueckert and Gerhard Neumann and Marc Toussaint and Wolfgang Maass},
    abstract = {Biological movement generation combines three interesting aspects: its modular organization in movement primitives (MPs), its characteristics of stochastic optimality under perturbations, and its efficiency in terms of learning. A common approach to motor skill learning is to endow the primitives with dynamical systems. Here, the parameters of the primitive indirectly define the shape of a reference trajectory. We propose an alternative MP representation based on probabilistic inference in learned graphical models with new and interesting properties that complies with salient features of biological movement control. Instead of endowing the primitives with dynamical systems, we propose to endow MPs with an intrinsic probabilistic planning system, integrating the power of stochastic optimal control (SOC) methods within a MP. The parameterization of the primitive is a graphical model that represents the dynamics and intrinsic cost function such that inference in this graphical model yields the control policy. We parameterize the intrinsic cost function using task-relevant features, such as the importance of passing through certain via-points. The system dynamics as well as intrinsic cost function parameters are learned in a reinforcement learning (RL) setting. We evaluate our approach on a complex 4-link balancing task. Our experiments show that our movement representation facilitates learning significantly and leads to better generalization to new task settings without re-learning.},
    url = {http://eprints.lincoln.ac.uk/25789/},
    keywords = {ARRAY(0x56147fc39928)}
    }

2012

  • H. B. Amor, O. Kroemer, U. Hillenbrand, G. Neumann, and J. Peters, “Generalization of human grasping for multi-fingered robot hands,” in International Conference on Robot Systems (IROS), 2012, pp. 2043-2050.
    [BibTeX] [Abstract] [Download PDF]

    Multi-fingered robot grasping is a challenging problem that is difficult to tackle using hand-coded programs. In this paper we present an imitation learning approach for learning and generalizing grasping skills based on human demonstrations. To this end, we split the task of synthesizing a grasping motion into three parts: (1) learning efficient grasp representations from human demonstrations, (2) warping contact points onto new objects, and (3) optimizing and executing the reach-and-grasp movements. We learn low-dimensional latent grasp spaces for different grasp types, which form the basis for a novel extension to dynamic motor primitives. These latent-space dynamic motor primitives are used to synthesize entire reach-and-grasp movements. We evaluated our method on a real humanoid robot. The results of the experiment demonstrate the robustness and versatility of our approach.

    @inproceedings{lirolem25788,
    title = {Generalization of human grasping for multi-fingered robot hands},
    year = {2012},
    author = {Heni Ben Amor and Oliver Kroemer and Ulrich Hillenbrand and Gerhard Neumann and Jan Peters},
    booktitle = {International Conference on Robot Systems (IROS)},
    month = {December},
    pages = {2043--2050},
    journal = {IEEE International Conference on Intelligent Robots and Systems},
    abstract = {Multi-fingered robot grasping is a challenging
    problem that is difficult to tackle using hand-coded programs.
    In this paper we present an imitation learning approach for
    learning and generalizing grasping skills based on human
    demonstrations. To this end, we split the task of synthesizing
    a grasping motion into three parts: (1) learning efficient grasp
    representations from human demonstrations, (2) warping contact
    points onto new objects, and (3) optimizing and executing
    the reach-and-grasp movements. We learn low-dimensional
    latent grasp spaces for different grasp types, which form the
    basis for a novel extension to dynamic motor primitives. These
    latent-space dynamic motor primitives are used to synthesize
    entire reach-and-grasp movements. We evaluated our method
    on a real humanoid robot. The results of the experiment
    demonstrate the robustness and versatility of our approach.},
    url = {http://eprints.lincoln.ac.uk/25788/},
    keywords = {ARRAY(0x56147fc398e0)}
    }

  • C. Daniel, G. Neumann, and J. Peters, “Learning concurrent motor skills in versatile solution spaces,” in International Conference on Intelligent Robot Systems (IROS), 2012, pp. 3591-3597.
    [BibTeX] [Abstract] [Download PDF]

    Future robots need to autonomously acquire motor skills in order to reduce their reliance on human programming. Many motor skill learning methods concentrate on learning a single solution for a given task. However, discarding information about additional solutions during learning unnecessarily limits autonomy. Such favoring of single solutions often requires re-learning of motor skills when the task, the environment or the robot?s body changes in a way that renders the learned solution infeasible. Future robots need to be able to adapt to such changes and, ideally, have a large repertoire of movements to cope with such problems. In contrast to current methods, our approach simultaneously learns multiple distinct solutions for the same task, such that a partial degeneration of this solution space does not prevent the successful completion of the task. In this paper, we present a complete framework that is capable of learning different solution strategies for a real robot Tetherball task.

    @inproceedings{lirolem25787,
    title = {Learning concurrent motor skills in versatile solution spaces},
    year = {2012},
    author = {C. Daniel and G. Neumann and J. Peters},
    booktitle = {International Conference on Intelligent Robot Systems (IROS)},
    month = {October},
    pages = {3591--3597},
    journal = {IEEE International Conference on Intelligent Robots and Systems},
    abstract = {Future robots need to autonomously acquire motor
    skills in order to reduce their reliance on human programming.
    Many motor skill learning methods concentrate
    on learning a single solution for a given task. However, discarding
    information about additional solutions during learning
    unnecessarily limits autonomy. Such favoring of single solutions
    often requires re-learning of motor skills when the task, the
    environment or the robot?s body changes in a way that renders
    the learned solution infeasible. Future robots need to be able to
    adapt to such changes and, ideally, have a large repertoire of
    movements to cope with such problems. In contrast to current
    methods, our approach simultaneously learns multiple distinct
    solutions for the same task, such that a partial degeneration of
    this solution space does not prevent the successful completion
    of the task. In this paper, we present a complete framework
    that is capable of learning different solution strategies for a
    real robot Tetherball task.},
    url = {http://eprints.lincoln.ac.uk/25787/},
    keywords = {ARRAY(0x56147fc399d0)}
    }

  • C. Daniel, G. Neumann, and J. Peters, “Hierarchical relative entropy policy search,” in Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS) 2012, 2012, pp. 273-281.
    [BibTeX] [Abstract] [Download PDF]

    Many real-world problems are inherently hierarchically structured. The use of this structure in an agent?s policy may well be the key to improved scalability and higher performance. However, such hierarchical structures cannot be exploited by current policy search algorithms. We will concentrate on a basic, but highly relevant hierarchy {–} the ?mixed option? policy. Here, a gating network first decides which of the options to execute and, subsequently, the option-policy determines the action. In this paper, we reformulate learning a hierarchical policy as a latent variable estimation problem and subsequently extend the Relative Entropy Policy Search (REPS) to the latent variable case. We show that our Hierarchical REPS can learn versatile solutions while also showing an increased performance in terms of learning speed and quality of the found policy in comparison to the nonhierarchical approach.

    @inproceedings{lirolem25791,
    pages = {273--281},
    month = {April},
    title = {Hierarchical relative entropy policy search},
    year = {2012},
    author = {Christian Daniel and Gerhard Neumann and Jan Peters},
    volume = {22},
    booktitle = {Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS) 2012},
    publisher = {MIT Press},
    keywords = {ARRAY(0x56147fc39a00)},
    abstract = {Many real-world problems are inherently hierarchically
    structured. The use of this structure
    in an agent?s policy may well be the
    key to improved scalability and higher performance.
    However, such hierarchical structures
    cannot be exploited by current policy
    search algorithms. We will concentrate on
    a basic, but highly relevant hierarchy {--} the
    ?mixed option? policy. Here, a gating network
    first decides which of the options to execute
    and, subsequently, the option-policy determines
    the action.
    In this paper, we reformulate learning a hierarchical
    policy as a latent variable estimation
    problem and subsequently extend the
    Relative Entropy Policy Search (REPS) to
    the latent variable case. We show that our
    Hierarchical REPS can learn versatile solutions
    while also showing an increased performance
    in terms of learning speed and quality
    of the found policy in comparison to the nonhierarchical
    approach.},
    url = {http://eprints.lincoln.ac.uk/25791/}
    }

2011

  • H. Hauser, G. Neumann, A. J. Ijspeert, and W. Maass, “Biologically inspired kinematic synergies enable linear balance control of a humanoid robot,” Biological Cybernetics, vol. 104, iss. 4-5, pp. 235-249, 2011.
    [BibTeX] [Abstract] [Download PDF]

    Despite many efforts, balance control of humanoid robots in the presence of unforeseen external or internal forces has remained an unsolved problem. The difficulty of this problem is a consequence of the high dimensionality of the action space of a humanoid robot, due to its large number of degrees of freedom (joints), and of non-linearities in its kinematic chains. Biped biological organisms face similar difficulties, but have nevertheless solved this problem. Experimental data reveal that many biological organisms reduce the high dimensionality of their action space by generating movements through linear superposition of a rather small number of stereotypical combinations of simultaneous movements of many joints, to which we refer as kinematic synergies in this paper. We show that by constructing two suitable non-linear kinematic synergies for the lower part of the body of a humanoid robot, balance control can in fact be reduced to a linear control problem, at least in the case of relatively slow movements. We demonstrate for a variety of tasks that the humanoid robot HOAP-2 acquires through this approach the capability to balance dynamically against unforeseen disturbances that may arise from external forces or from manipulating unknown loads.

    @article{lirolem25794,
    volume = {104},
    publisher = {Springer},
    month = {May},
    pages = {235--249},
    journal = {Biological Cybernetics},
    year = {2011},
    title = {Biologically inspired kinematic synergies enable linear balance control of a humanoid robot},
    number = {4-5},
    author = {Helmut Hauser and Gerhard Neumann and Auke J. Ijspeert and Wolfgang Maass},
    keywords = {ARRAY(0x56147fc399a0)},
    url = {http://eprints.lincoln.ac.uk/25794/},
    abstract = {Despite many efforts, balance control of humanoid robots in the presence of unforeseen external or internal forces has remained an unsolved problem. The difficulty of this problem is a consequence of the high dimensionality of the action space of a humanoid robot, due to its large number of degrees of freedom (joints), and of non-linearities in its kinematic chains. Biped biological organisms face similar difficulties, but have nevertheless solved this problem. Experimental data reveal that many biological organisms reduce the high dimensionality of their action space by generating movements through linear superposition of a rather small number of stereotypical combinations of simultaneous movements of many joints, to which we refer as kinematic synergies in this paper. We show that by constructing two suitable non-linear kinematic synergies for the lower part of the body of a humanoid robot, balance control can in fact be reduced to a linear control problem, at least in the case of relatively slow movements. We demonstrate for a variety of tasks that the humanoid robot HOAP-2 acquires through this approach the capability to balance dynamically against unforeseen disturbances that may arise from external forces or from manipulating unknown loads.}
    }

  • G. Neumann, “Variational inference for policy search in changing situations,” in 28th International Conference on Machine Learning (ICML-11), 2011, pp. 817-824.
    [BibTeX] [Abstract] [Download PDF]

    Many policy search algorithms minimize the Kullback-Leibler (KL) divergence to a certain target distribution in order to fit their policy. The commonly used KL-divergence forces the resulting policy to be ?reward-attracted?. The policy tries to reproduce all positively rewarded experience while negative experience is neglected. However, the KL-divergence is not symmetric and we can also minimize the the reversed KL-divergence, which is typically used in variational inference. The policy now becomes ?cost-averse?. It tries to avoid reproducing any negatively-rewarded experience while maximizing exploration. Due to this ?cost-averseness? of the policy, Variational Inference for Policy Search (VIP) has several interesting properties. It requires no kernelbandwith nor exploration rate, such settings are determined automatically by the inference. The algorithm meets the performance of state-of-theart methods while being applicable to simultaneously learning in multiple situations. We concentrate on using VIP for policy search in robotics. We apply our algorithm to learn dynamic counterbalancing of different kinds of pushes with human-like 2-link and 4-link robots.

    @inproceedings{lirolem25793,
    journal = {Proceedings of the 28th International Conference on Machine Learning, ICML 2011},
    pages = {817--824},
    month = {June},
    booktitle = {28th International Conference on Machine Learning (ICML-11)},
    author = {Gerhard Neumann},
    year = {2011},
    title = {Variational inference for policy search in changing situations},
    abstract = {Many policy search algorithms minimize the Kullback-Leibler (KL) divergence to a certain
    target distribution in order to fit their policy. The commonly used KL-divergence forces the resulting
    policy to be ?reward-attracted?. The policy tries to reproduce all positively rewarded experience
    while negative experience is neglected. However, the KL-divergence is not symmetric
    and we can also minimize the the reversed KL-divergence, which is typically used in variational
    inference. The policy now becomes ?cost-averse?. It tries to avoid reproducing any negatively-rewarded experience while maximizing exploration. Due to this ?cost-averseness? of the policy, Variational Inference for Policy Search (VIP) has several interesting properties. It requires no kernelbandwith nor exploration rate, such settings are
    determined automatically by the inference. The algorithm meets the performance of state-of-theart
    methods while being applicable to simultaneously learning in multiple situations. We concentrate on using VIP for policy search in robotics. We apply our algorithm to learn dynamic counterbalancing of different kinds of
    pushes with human-like 2-link and 4-link robots.},
    url = {http://eprints.lincoln.ac.uk/25793/},
    keywords = {ARRAY(0x56147fc39a30)}
    }

2009

  • G. Neumann, W. Maass, and J. Peters, “Learning complex motions by sequencing simpler motion templates,” in 26th Annual International Conference on Machine Learning (ICML 2009), 2009, pp. 753-760.
    [BibTeX] [Abstract] [Download PDF]

    Abstraction of complex, longer motor tasks into simpler elemental movements enables humans and animals to exhibit motor skills which have not yet been matched by robots. Humans intuitively decompose complex motions into smaller, simpler segments. For example when describing simple movements like drawing a triangle with a pen, we can easily name the basic steps of this movement. Surprisingly, such abstractions have rarely been used in artificial motor skill learning algorithms. These algorithms typically choose a new action (such as a torque or a force) at a very fast time-scale. As a result, both policy and temporal credit assignment problem become unnecessarily complex – often beyond the reach of current machine learning methods. We introduce a new framework for temporal abstractions in reinforcement learning (RL), i.e. RL with motion templates. We present a new algorithm for this framework which can learn high-quality policies by making only few abstract decisions.

    @inproceedings{lirolem25795,
    title = {Learning complex motions by sequencing simpler motion templates},
    year = {2009},
    author = {Gerhard Neumann and W. Maass and J. Peters},
    pages = {753--760},
    booktitle = {26th Annual International Conference on Machine Learning (ICML 2009)},
    month = {June},
    journal = {Proceedings of the 26th International Conference On Machine Learning, ICML 2009},
    keywords = {ARRAY(0x56147fc39970)},
    url = {http://eprints.lincoln.ac.uk/25795/},
    abstract = {Abstraction of complex, longer motor tasks into simpler elemental movements enables humans and animals to exhibit motor skills which have not yet been matched by robots. Humans intuitively decompose complex motions into smaller, simpler segments. For example when describing simple movements like drawing a triangle with a pen, we can easily name the basic steps of this movement.
    Surprisingly, such abstractions have rarely been used in artificial motor skill learning algorithms. These algorithms typically choose a new action (such as a torque or a force) at a very fast time-scale. As a result, both policy and temporal credit assignment problem become unnecessarily complex - often beyond the reach of current machine learning methods.
    We introduce a new framework for temporal abstractions in reinforcement learning (RL), i.e. RL with motion templates. We present a new algorithm for this framework which can learn high-quality policies by making only few abstract decisions.}
    }

  • G. Neumann and J. Peters, “Fitted Q-iteration by advantage weighted regression,” in Advances in Neural Information Processing Systems 22 (NIPS 2008), 2009, pp. 1177-1184.
    [BibTeX] [Abstract] [Download PDF]

    Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage weighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces.

    @inproceedings{lirolem25796,
    title = {Fitted Q-iteration by advantage weighted regression},
    year = {2009},
    author = {Gerhard Neumann and Jan Peters},
    booktitle = {Advances in Neural Information Processing Systems 22 (NIPS 2008)},
    month = {June},
    pages = {1177--1184},
    journal = {Advances in Neural Information Processing Systems 21 - Proceedings of the 2008 Conference},
    keywords = {ARRAY(0x56147fc39a78)},
    url = {http://eprints.lincoln.ac.uk/25796/},
    abstract = {Recently, fitted Q-iteration (FQI) based methods have become more popular due
    to their increased sample efficiency, a more stable learning process and the higher
    quality of the resulting policy. However, these methods remain hard to use for continuous
    action spaces which frequently occur in real-world tasks, e.g., in robotics
    and other technical applications. The greedy action selection commonly used for
    the policy improvement step is particularly problematic as it is expensive for continuous
    actions, can cause an unstable learning process, introduces an optimization
    bias and results in highly non-smooth policies unsuitable for real-world systems.
    In this paper, we show that by using a soft-greedy action selection the policy
    improvement step used in FQI can be simplified to an inexpensive advantage weighted
    regression. With this result, we are able to derive a new, computationally
    efficient FQI algorithm which can even deal with high dimensional action spaces.}
    }