ICAPS 2017: State-regularized policy search for linearized dynamical systems

Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of feedbackcontrollers by taking advantage of local approximations of model dynamics and cost functions. Stability of the policy update is a major issue for these methods, rendering them hard to apply for highly nonlinear systems. Recent approaches combine classical Stochastic Optimal Control methods with information-theoretic bounds to control the step-size of the policy update and could even be used to train nonlinear deep control policies. These methods bound the relative entropy between the new and the old policy to ensure a stable policy update. However, despite the bound in policy space, the state distributions of two consecutive policies can still differ significantly, rendering the used local approximate models invalid. To alleviate this issue we propose enforcing a relative entropy constraint not only on the policy update, but also on the update of the state distribution, around which the dynamics and cost are being approximated. We present a derivation of the closed-form policy update and show that our approach outperforms related methods on two nonlinear and highly dynamic simulated systems.

  • H. Abdulsamad, O. Arenz, J. Peters, and G. Neumann, “State-regularized policy search for linearized dynamical systems,” in Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), 2017.
    [BibTeX] [Abstract] [Download PDF]

    Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of feedbackcontrollers by taking advantage of local approximations of model dynamics and cost functions. Stability of the policy update is a major issue for these methods, rendering them hard to apply for highly nonlinear systems. Recent approaches combine classical Stochastic Optimal Control methods with information-theoretic bounds to control the step-size of the policy update and could even be used to train nonlinear deep control policies. These methods bound the relative entropy between the new and the old policy to ensure a stable policy update. However, despite the bound in policy space, the state distributions of two consecutive policies can still differ significantly, rendering the used local approximate models invalid. To alleviate this issue we propose enforcing a relative entropy constraint not only on the policy update, but also on the update of the state distribution, around which the dynamics and cost are being approximated. We present a derivation of the closed-form policy update and show that our approach outperforms related methods on two nonlinear and highly dynamic simulated systems.

    @inproceedings{lirolem27055,
    month = {June},
    year = {2017},
    title = {State-regularized policy search for linearized dynamical systems},
    author = {Hany Abdulsamad and Oleg Arenz and Jan Peters and Gerhard Neumann},
    booktitle = {Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS)},
    url = {http://eprints.lincoln.ac.uk/27055/},
    abstract = {Trajectory-Centric Reinforcement Learning and Trajectory
    Optimization methods optimize a sequence of feedbackcontrollers
    by taking advantage of local approximations of
    model dynamics and cost functions. Stability of the policy update
    is a major issue for these methods, rendering them hard
    to apply for highly nonlinear systems. Recent approaches
    combine classical Stochastic Optimal Control methods with
    information-theoretic bounds to control the step-size of the
    policy update and could even be used to train nonlinear deep
    control policies. These methods bound the relative entropy
    between the new and the old policy to ensure a stable policy
    update. However, despite the bound in policy space, the
    state distributions of two consecutive policies can still differ
    significantly, rendering the used local approximate models invalid.
    To alleviate this issue we propose enforcing a relative
    entropy constraint not only on the policy update, but also on
    the update of the state distribution, around which the dynamics
    and cost are being approximated. We present a derivation
    of the closed-form policy update and show that our approach
    outperforms related methods on two nonlinear and highly dynamic
    simulated systems.},
    keywords = {ARRAY(0x55fe0a235330)}
    }