Use a graph to describe the relationship of Deep Reinforcement Learning baselines. Table the characteristic of these algorithms. List out the papers and blog posts of these baselines.

Reinforcement Learning Basics

✏️ Author’s Note

I strongly recommend using VSCode with Python Interactive for those who are not familiar with OpenAI gym. It preserves the auto-completion for editing *.py file while having the ability to execute cells like using Jupyter Notebook. It also renders correctly and work with virtualenv or anaconda virtual environments.

Algorithm Comparison

Baseline Algorithms

Author & Year Algorithm Type On/Off-policy Trust Region Experience Replay Importance Sampling
Mnih et al., 2015 DQN Q-Learning Off-policy None Uniform No
Hasselt et al., 2016 DDQN Q-Learning Off-policy None Uniform No
Wang et al., 2016 Dueling-DQN Q-Learning Off-policy None Uniform No
Hausknecht et al., 2015 DRQN Q-Learning Off-policy None Uniform No
Bellemare et al., 2017 C51 Q-Learning Off-policy None Uniform No
Hessel et al., 2018 Rainbow-DQN Q-Learning Off-policy None Prioritized No
Schulman et al., 2015 TRPO Policy Gradient On-policy Policy None No
Schulman et al., 2017 PPO Policy Gradient On-policy Policy None No
Mnih et al., 2016 A3C Actor-Critic On-policy None None No
Babaeizadeh et al., 2017 GA3C Actor-Critic On-policy None None No
(None) A2C Actor-Critic On-policy None None No
Clemente et al., 2017 PAAC Actor-Critic On-policy None None No
Wang et al., 2017 ACER Actor-Critic Off-Policy Policy Uniform Yes
Wu et al., 2017 ACKTR Actor-Critic On-Policy Both None No
Lillicrap et al., 2016 DDPG Actor-Critic Off-Policy None Uniform No
Fujimoto et al., 2018 TD3 Actor-Critic Off-Policy None Uniform No
Haarnoja et al., 2019 SAC Actor-Critic Off-Policy None Uniform No
Ciosek et al., 2019 OAC Actor-Critic Off-Policy None Uniform No

Accessories

Author & Year Algorithm Enhancement Requirements
Schaul et al., 2016 PER Sample Efficiency Replay Buffer
Andrychowicz et al., 2017 HER Sample Efficiency Similar Goals
Fortunato et al., 2018 NoisyNet Exploration Neural Networks
Pathak et al., 2017 Curiosity / ICM Exploration Predictable States
Burda et al., 2019 RND Exploration Countable States

DRL Baselines

Some common and important baselines used for benchmarking.

Value Based Methods

DQN (Deep Q-Network)

DDQN (Double Deep Q-Network)

Dueling-DQN

DRQN (Deep Recurrent Q-Network)

C51 / Categorical DQN / Distributional DQN

Rainbow DQN

  • Hessel et al., 2018, Rainbow: Combining Improvements in Deep Reinforcement Learning, AAAI 2018

Policy Gradient Methods

A good blog post by Lilian, helped me a lot when reading PG papers.

TRPO (Trust Region Policy Optimization)

PPO (Proximal Policy Optimization) and TRPO+

Actor-Critic Methods

A3C (Asynchronous Advantage Actor-Critic) and GA3C (Hybrid CPU / GPU A3C)

A2C (Synchronous Advantage Actor-Critic)

ACER (Actor-Critic with Experience Replay)

ACKTR (Actor-Critic using Kronecker-Factored Trust Region)

  • Yuhuai Wu et al., 2017, Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation, NIPS 2017

DPG (Deterministic Policy Gradient) and DDPG (Deep Deterministic Policy Gradient)

TD3 (Twin Delayed Deep Deterministic Policy Gradient)

SAC (Soft Actor-Critic)

  • Haarnoja et al., 2018, Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML 2018
  • Haarnoja et al., 2019, Soft Actor-Critic Algorithms and Applications, arXiv preprint

OAC (Optimistic Actor-Critic)

Accessories

Methods that can improve existing algorithms.

PER (Prioritized Experience Replay)

HER

NoisyNet

Curiosity / Intrinsic Curiosity Module (ICM)

RND