Table of Contents
Use a graph to describe the relationship of Deep Reinforcement Learning baselines. Table the characteristic of these algorithms. List out the papers and blog posts of these baselines.
Reinforcement Learning Basics
📄 Learning Resources
- Reinforcement Learning: An Introduction, Sutton & Barto
- UCL Course on RL, David Silver
- CS 294 Deep Reinforcement Learning, Fall 2017, Sergey Levine
- [Amazon] Deep Reinforcement Learning Hands-On: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more, Maxim Lapan
✏️ Author’s Note
I strongly recommend using VSCode with Python Interactive for those who are not familiar with OpenAI gym. It preserves the auto-completion for editing *.py
file while having the ability to execute cells like using Jupyter Notebook. It also renders correctly and work with virtualenv or anaconda virtual environments.
Algorithm Comparison
Baseline Algorithms
Author & Year | Algorithm | Type | On/Off-policy | Trust Region | Experience Replay | Importance Sampling |
---|---|---|---|---|---|---|
Mnih et al., 2015 | DQN | Q-Learning | Off-policy | None | Uniform | No |
Hasselt et al., 2016 | DDQN | Q-Learning | Off-policy | None | Uniform | No |
Wang et al., 2016 | Dueling-DQN | Q-Learning | Off-policy | None | Uniform | No |
Hausknecht et al., 2015 | DRQN | Q-Learning | Off-policy | None | Uniform | No |
Bellemare et al., 2017 | C51 | Q-Learning | Off-policy | None | Uniform | No |
Hessel et al., 2018 | Rainbow-DQN | Q-Learning | Off-policy | None | Prioritized | No |
Schulman et al., 2015 | TRPO | Policy Gradient | On-policy | Policy | None | No |
Schulman et al., 2017 | PPO | Policy Gradient | On-policy | Policy | None | No |
Mnih et al., 2016 | A3C | Actor-Critic | On-policy | None | None | No |
Babaeizadeh et al., 2017 | GA3C | Actor-Critic | On-policy | None | None | No |
(None) | A2C | Actor-Critic | On-policy | None | None | No |
Clemente et al., 2017 | PAAC | Actor-Critic | On-policy | None | None | No |
Wang et al., 2017 | ACER | Actor-Critic | Off-Policy | Policy | Uniform | Yes |
Wu et al., 2017 | ACKTR | Actor-Critic | On-Policy | Both | None | No |
Lillicrap et al., 2016 | DDPG | Actor-Critic | Off-Policy | None | Uniform | No |
Fujimoto et al., 2018 | TD3 | Actor-Critic | Off-Policy | None | Uniform | No |
Haarnoja et al., 2019 | SAC | Actor-Critic | Off-Policy | None | Uniform | No |
Ciosek et al., 2019 | OAC | Actor-Critic | Off-Policy | None | Uniform | No |
Accessories
Author & Year | Algorithm | Enhancement | Requirements |
---|---|---|---|
Schaul et al., 2016 | PER | Sample Efficiency | Replay Buffer |
Andrychowicz et al., 2017 | HER | Sample Efficiency | Similar Goals |
Fortunato et al., 2018 | NoisyNet | Exploration | Neural Networks |
Pathak et al., 2017 | Curiosity / ICM | Exploration | Predictable States |
Burda et al., 2019 | RND | Exploration | Countable States |
DRL Baselines
Some common and important baselines used for benchmarking.
Value Based Methods
DQN (Deep Q-Network)
- Mnih et al., 2013, Playing Atari with Deep Reinforcement Learning, NIPS 2013.
- Mnih et al., 2015, Human-level control through deep reinforcement learning, Nature 2015.
DDQN (Double Deep Q-Network)
- van Hasselt et al., 2016, Deep Reinforcement Learning with Double Q-Learning, AAAI 2016
Dueling-DQN
-
Ziyu Wang et al., 2016, Dueling Network Architectures for Deep Reinforcement Learning, ICML 2016
DRQN (Deep Recurrent Q-Network)
- Hausknecht et al., 2015, Deep Recurrent Q-Learning for Partially Observable MDPs, AAAI 2015
C51 / Categorical DQN / Distributional DQN
- Bellemare et al., 2017, A Distributional Perspective on Reinforcement Learning, ICML 2017
Rainbow DQN
- Hessel et al., 2018, Rainbow: Combining Improvements in Deep Reinforcement Learning, AAAI 2018
Policy Gradient Methods
A good blog post by Lilian, helped me a lot when reading PG papers.
TRPO (Trust Region Policy Optimization)
- Schulman et al., 2015, Trust Region Policy Optimization, ICML 2015
- OpenAI Spinning Up, Trust Region Policy Optimization
PPO (Proximal Policy Optimization) and TRPO+
- Schulman et al., 2017, Proximal Policy Optimization Algorithms, arXiv preprint
- OpenAI blog, 2017, Proximal Policy Optimization
- OpenAI Spinning Up, Proximal Policy Optimization
- Engstrom et al., 2020, Implementation Matters in Deep RL: A Case Study on PPO and TRPO, ICLR 2020 (accepted)
Actor-Critic Methods
A3C (Asynchronous Advantage Actor-Critic) and GA3C (Hybrid CPU / GPU A3C)
- Mnih et al., 2016, Asynchronous Methods for Deep Reinforcement Learning, ICML 2016
- Babaeizadeh et al., 2017, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017
A2C (Synchronous Advantage Actor-Critic)
- OpenAI blog, 2017, OpenAI Baselines: ACKTR & A2C
- Clemente et al., 2017, Efficient Parallel Methods for Deep Reinforcement Learning, arXiv preprint
ACER (Actor-Critic with Experience Replay)
- Wang et al., 2017, Sample Efficient Actor-Critic with Experience Replay, ICLR 2017
ACKTR (Actor-Critic using Kronecker-Factored Trust Region)
- Yuhuai Wu et al., 2017, Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation, NIPS 2017
DPG (Deterministic Policy Gradient) and DDPG (Deep Deterministic Policy Gradient)
- Silver et al., 2014, Deterministic Policy Gradient Algorithms, ICML 2014
- Lillicrap et al., 2016, Continuous control with deep reinforcement learning, ICLR 2016
TD3 (Twin Delayed Deep Deterministic Policy Gradient)
- Fujimoto et al., 2018, Addressing Function Approximation Error in Actor-Critic Methods, ICML 2018
SAC (Soft Actor-Critic)
- Haarnoja et al., 2018, Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML 2018
- Haarnoja et al., 2019, Soft Actor-Critic Algorithms and Applications, arXiv preprint
OAC (Optimistic Actor-Critic)
- Ciosek et al., 2019, Better Exploration with Optimistic Actor-Critic, NIPS 2019
Accessories
Methods that can improve existing algorithms.
PER (Prioritized Experience Replay)
- Schaul et al., 2016, Prioritized Experience Replay, ICLR 2016
HER
- Andrychowicz et al., 2017, Hindsight Experience Replay, NIPS 2017
NoisyNet
- Fortunato et al., 2018, Noisy Networks for Exploration, ICLR 2018
- OpenAI blog, 2017, Better Exploration with Parameter Noise, 2017
Curiosity / Intrinsic Curiosity Module (ICM)
- Pathak et al., 2017, Curiosity-driven Exploration by Self-supervised Prediction, ICML 2017
- Burda et al., 2019, Large-Scale Study of Curiosity-Driven Learning, ICLR 2019
- OpenAI blog, 2018, Reinforcement Learning with Prediction-Based Rewards
RND
- Burda et al., 2019, Exploration by Random Network Distillation, ICLR 2019
- OpenAI blog, 2018, Reinforcement Learning with Prediction-Based Rewards