Learning from Human Feedback

Survey papers:

  • T. Osa, J. Pajarinen, G. Neumann, A. J. Bagnell, P. Abbeel, and J. Peters, “An Algorithmic Perspective on Imitation Learning,” Foundations and Trends® in Robotics, vol. 7, iss. 1-2, pp. 1-179, 2018. doi:10.1561/2300000053
    [BibTeX] [Download PDF]
    @article{ROB-053,
    url = {http://dx.doi.org/10.1561/2300000053},
    year = {2018},
    volume = {7},
    journal = {Foundations and Trends® in Robotics},
    title = {An Algorithmic Perspective on Imitation Learning},
    doi = {10.1561/2300000053},
    issn = {1935-8253},
    number = {1-2},
    pages = {1-179},
    author = {Takayuki Osa and Joni Pajarinen and Gerhard Neumann and J. Andrew Bagnell and Pieter Abbeel and Jan Peters}
    }

  • C. Wirth, R. Akrour, G. Neumann, and J. Furnkranz, “A Survey of Preference-Based Reinforcement Learning Methods,” Journal of Machine Learning Research, vol. 18, iss. 136, pp. 1-46, 2017.
    [BibTeX] [Download PDF]
    @Article{JMLR:v18:16-634,
    Title = {A Survey of Preference-Based Reinforcement Learning Methods},
    Author = {Christian Wirth and Riad Akrour and Gerhard Neumann and Johannes Furnkranz},
    Journal = {Journal of Machine Learning Research},
    Year = {2017},
    Number = {136},
    Pages = {1-46},
    Volume = {18},
    Url = {http://jmlr.org/papers/v18/16-634.html}
    }

 

 

  • RAL 2017: Guiding trajectory optimization by demonstrated distributions

    Trajectory optimization is an essential tool for motion planning under multiple constraints of robotic manipulators. Optimization-based methods can explicitly optimize a trajectory by leveraging prior knowledge of the system and have been used in various applications such as collision avoidance. However, these methods often require a hand-coded cost function in order to achieve the desired behavior. Specifying such cost function for a complex desired behavior, e.g., disentangling a rope, is a nontrivial task that is often even infeasible. Learning from demonstration (LfD) methods offer an alternative way to program robot motion. LfD methods are less dependent on analytical models and instead learn the behavior of experts implicitly from the demonstrated trajectories. However, the problem of adapting the demonstrations to new situations, e.g., avoiding newly introduced obstacles, has not been fully investigated in the literature. In this paper, we present a motion planning framework that combines the advantages of optimization-based and demonstration-based methods. We learn a distribution of trajectories demonstrated by human experts and use it to guide the trajectory optimization
    process. The resulting trajectory maintains the demonstrated behaviors, which are essential to performing the task successfully, while adapting the trajectory to avoid obstacles. In simulated experiments and with a real robotic system, we verify that our approach optimizes the trajectory to avoid obstacles and encodes the demonstrated behavior in the resulting trajectory

    • T. Osa, A. G. M. Esfahani, R. Stolkin, R. Lioutikov, J. Peters, and G. Neumann, “Guiding trajectory optimization by demonstrated distributions,” IEEE Robotics and Automation Letters (RA-L), vol. 2, iss. 2, pp. 819-826, 2017.
      [BibTeX] [Abstract] [Download PDF]

      Trajectory optimization is an essential tool for motion planning under multiple constraints of robotic manipulators. Optimization-based methods can explicitly optimize a trajectory by leveraging prior knowledge of the system and have been used in various applications such as collision avoidance. However, these methods often require a hand-coded cost function in order to achieve the desired behavior. Specifying such cost function for a complex desired behavior, e.g., disentangling a rope, is a nontrivial task that is often even infeasible. Learning from demonstration (LfD) methods offer an alternative way to program robot motion. LfD methods are less dependent on analytical models and instead learn the behavior of experts implicitly from the demonstrated trajectories. However, the problem of adapting the demonstrations to new situations, e.g., avoiding newly introduced obstacles, has not been fully investigated in the literature. In this paper, we present a motion planning framework that combines the advantages of optimization-based and demonstration-based methods. We learn a distribution of trajectories demonstrated by human experts and use it to guide the trajectory optimization process. The resulting trajectory maintains the demonstrated behaviors, which are essential to performing the task successfully, while adapting the trajectory to avoid obstacles. In simulated experiments and with a real robotic system, we verify that our approach optimizes the trajectory to avoid obstacles and encodes the demonstrated behavior in the resulting trajectory

      @article{lirolem26731,
      volume = {2},
      publisher = {IEEE},
      journal = {IEEE Robotics and Automation Letters (RA-L)},
      month = {January},
      pages = {819--826},
      number = {2},
      author = {Takayuki Osa and Amir M. Ghalamzan Esfahani and Rustam Stolkin and Rudolf Lioutikov and Jan Peters and Gerhard Neumann},
      title = {Guiding trajectory optimization by demonstrated distributions},
      year = {2017},
      url = {http://eprints.lincoln.ac.uk/26731/},
      abstract = {Trajectory optimization is an essential tool for motion
      planning under multiple constraints of robotic manipulators.
      Optimization-based methods can explicitly optimize a trajectory
      by leveraging prior knowledge of the system and have been used
      in various applications such as collision avoidance. However, these
      methods often require a hand-coded cost function in order to
      achieve the desired behavior. Specifying such cost function for
      a complex desired behavior, e.g., disentangling a rope, is a nontrivial
      task that is often even infeasible. Learning from demonstration
      (LfD) methods offer an alternative way to program robot
      motion. LfD methods are less dependent on analytical models
      and instead learn the behavior of experts implicitly from the
      demonstrated trajectories. However, the problem of adapting the
      demonstrations to new situations, e.g., avoiding newly introduced
      obstacles, has not been fully investigated in the literature. In this
      paper, we present a motion planning framework that combines
      the advantages of optimization-based and demonstration-based
      methods. We learn a distribution of trajectories demonstrated by
      human experts and use it to guide the trajectory optimization
      process. The resulting trajectory maintains the demonstrated
      behaviors, which are essential to performing the task successfully,
      while adapting the trajectory to avoid obstacles. In simulated
      experiments and with a real robotic system, we verify that our
      approach optimizes the trajectory to avoid obstacles and encodes
      the demonstrated behavior in the resulting trajectory},
      keywords = {ARRAY(0x56147fc52438)}
      }

    Video:

  • AAAI2016: Model-free Preference-based Reinforcement Learning

    Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preference-based estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences.

    • C. Wirth, J. Furnkranz, and G. Neumann, “Model-free preference-based reinforcement learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 2222-2228.
      [BibTeX] [Abstract] [Download PDF]

      Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preference-based estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences.

      @inproceedings{lirolem25746,
      pages = {2222--2228},
      booktitle = {Thirtieth AAAI Conference on Artificial Intelligence},
      month = {February},
      journal = {30th AAAI Conference on Artificial Intelligence, AAAI 2016},
      title = {Model-free preference-based reinforcement learning},
      year = {2016},
      author = {C. Wirth and J. Furnkranz and G. Neumann},
      abstract = {Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preference-based estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences.},
      url = {http://eprints.lincoln.ac.uk/25746/},
      keywords = {ARRAY(0x56147fc52600)}
      }