Data-driven trajectory optimization

  • IROS 2017: Hybrid control trajectory optimization under uncertainty

    Trajectory optimization is a fundamental problem in robotics. While optimization of continuous control trajectories is well developed, many applications require both discrete and continuous, i.e. hybrid controls. Finding an optimal sequence of hybrid controls is challenging due to the exponential explosion of discrete control combinations. Our method, based on Differential Dynamic Programming (DDP), circumvents this problem by incorporating discrete actions inside DDP: we first optimize continuous mixtures of discrete actions, and, subsequently force the mixtures into fully discrete actions. Moreover, we show how our approach can be extended to partially observable Markov decision processes (POMDPs) for trajectory planning under uncertainty. We validate the approach in a car driving problem where the robot has to switch discrete gears and in a box pushing application where the robot can switch the side of the box to push. The pose and the friction parameters of the pushed box are initially unknown and only indirectly observable.

    • J. Pajarinen, V. Kyrki, M. Koval, S. Srinivasa, J. Peters, and G. Neumann, “Hybrid control trajectory optimization under uncertainty,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Trajectory optimization is a fundamental problem in robotics. While optimization of continuous control trajectories is well developed, many applications require both discrete and continuous, i.e. hybrid controls. Finding an optimal sequence of hybrid controls is challenging due to the exponential explosion of discrete control combinations. Our method, based on Differential Dynamic Programming (DDP), circumvents this problem by incorporating discrete actions inside DDP: we first optimize continuous mixtures of discrete actions, and, subsequently force the mixtures into fully discrete actions. Moreover, we show how our approach can be extended to partially observable Markov decision processes (POMDPs) for trajectory planning under uncertainty. We validate the approach in a car driving problem where the robot has to switch discrete gears and in a box pushing application where the robot can switch the side of the box to push. The pose and the friction parameters of the pushed box are initially unknown and only indirectly observable.

      @inproceedings{lirolem28257,
      year = {2017},
      author = {J. Pajarinen and V. Kyrki and M. Koval and S Srinivasa and J. Peters and G. Neumann},
      month = {September},
      booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
      title = {Hybrid control trajectory optimization under uncertainty},
      abstract = {Trajectory optimization is a fundamental problem in robotics. While optimization of continuous control trajectories is well developed, many applications require both discrete and continuous, i.e. hybrid controls. Finding an optimal sequence of hybrid controls is challenging due to the exponential explosion of discrete control combinations. Our method, based on Differential Dynamic Programming (DDP), circumvents this problem by incorporating discrete actions inside DDP: we first optimize continuous mixtures of discrete actions, and, subsequently force the mixtures into fully discrete actions. Moreover, we show how our approach can be extended to partially observable Markov decision processes (POMDPs) for trajectory planning under uncertainty. We validate the approach in a car driving problem where the robot has to switch discrete gears and in a box pushing application where the robot can switch the side of the box to push. The pose and the friction parameters of the pushed box are initially unknown and only indirectly observable.},
      url = {http://eprints.lincoln.ac.uk/28257/},
      keywords = {ARRAY(0x559006cfd000)}
      }

  • ICAPS 2017: State-regularized policy search for linearized dynamical systems

    Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of feedbackcontrollers by taking advantage of local approximations of model dynamics and cost functions. Stability of the policy update is a major issue for these methods, rendering them hard to apply for highly nonlinear systems. Recent approaches combine classical Stochastic Optimal Control methods with information-theoretic bounds to control the step-size of the policy update and could even be used to train nonlinear deep control policies. These methods bound the relative entropy between the new and the old policy to ensure a stable policy update. However, despite the bound in policy space, the state distributions of two consecutive policies can still differ significantly, rendering the used local approximate models invalid. To alleviate this issue we propose enforcing a relative entropy constraint not only on the policy update, but also on the update of the state distribution, around which the dynamics and cost are being approximated. We present a derivation of the closed-form policy update and show that our approach outperforms related methods on two nonlinear and highly dynamic simulated systems.

    • H. Abdulsamad, O. Arenz, J. Peters, and G. Neumann, “State-regularized policy search for linearized dynamical systems,” in Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), 2017.
      [BibTeX] [Abstract] [Download PDF]

      Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of feedbackcontrollers by taking advantage of local approximations of model dynamics and cost functions. Stability of the policy update is a major issue for these methods, rendering them hard to apply for highly nonlinear systems. Recent approaches combine classical Stochastic Optimal Control methods with information-theoretic bounds to control the step-size of the policy update and could even be used to train nonlinear deep control policies. These methods bound the relative entropy between the new and the old policy to ensure a stable policy update. However, despite the bound in policy space, the state distributions of two consecutive policies can still differ significantly, rendering the used local approximate models invalid. To alleviate this issue we propose enforcing a relative entropy constraint not only on the policy update, but also on the update of the state distribution, around which the dynamics and cost are being approximated. We present a derivation of the closed-form policy update and show that our approach outperforms related methods on two nonlinear and highly dynamic simulated systems.

      @inproceedings{lirolem27055,
      author = {Hany Abdulsamad and Oleg Arenz and Jan Peters and Gerhard Neumann},
      month = {June},
      booktitle = {Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS)},
      year = {2017},
      title = {State-regularized policy search for linearized dynamical systems},
      abstract = {Trajectory-Centric Reinforcement Learning and Trajectory
      Optimization methods optimize a sequence of feedbackcontrollers
      by taking advantage of local approximations of
      model dynamics and cost functions. Stability of the policy update
      is a major issue for these methods, rendering them hard
      to apply for highly nonlinear systems. Recent approaches
      combine classical Stochastic Optimal Control methods with
      information-theoretic bounds to control the step-size of the
      policy update and could even be used to train nonlinear deep
      control policies. These methods bound the relative entropy
      between the new and the old policy to ensure a stable policy
      update. However, despite the bound in policy space, the
      state distributions of two consecutive policies can still differ
      significantly, rendering the used local approximate models invalid.
      To alleviate this issue we propose enforcing a relative
      entropy constraint not only on the policy update, but also on
      the update of the state distribution, around which the dynamics
      and cost are being approximated. We present a derivation
      of the closed-form policy update and show that our approach
      outperforms related methods on two nonlinear and highly dynamic
      simulated systems.},
      url = {http://eprints.lincoln.ac.uk/27055/},
      keywords = {ARRAY(0x55900737c968)}
      }

  • ICML 2016: Model-Free Trajectory Optimization for Reinforcement Learning

    Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy.
    In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

    • R. Akrour, A. Abdolmaleki, H. Abdulsamad, and G. Neumann, “Model-free trajectory optimization for reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2016, pp. 4342-4352.
      [BibTeX] [Abstract] [Download PDF]

      Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

      @inproceedings{lirolem25747,
      booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
      month = {June},
      volume = {6},
      title = {Model-free trajectory optimization for reinforcement learning},
      year = {2016},
      author = {R. Akrour and A. Abdolmaleki and H. Abdulsamad and G. Neumann},
      journal = {33rd International Conference on Machine Learning, ICML 2016},
      pages = {4342--4352},
      keywords = {ARRAY(0x5590082de978)},
      url = {http://eprints.lincoln.ac.uk/25747/},
      abstract = {Many of the recent Trajectory Optimization algorithms
      alternate between local approximation
      of the dynamics and conservative policy update.
      However, linearly approximating the dynamics
      in order to derive the new policy can bias the update
      and prevent convergence to the optimal policy.
      In this article, we propose a new model-free
      algorithm that backpropagates a local quadratic
      time-dependent Q-Function, allowing the derivation
      of the policy update in closed form. Our policy
      update ensures exact KL-constraint satisfaction
      without simplifying assumptions on the system
      dynamics demonstrating improved performance
      in comparison to related Trajectory Optimization
      algorithms linearizing the dynamics.}
      }