Three aspects of Deep RL: noise, overestimation and exploration

We touch on various sides of noise in Deep Reinforcement Learning models. Part 1 discusses overestimation, that is the harmful property resulting from noise. Parts 2 deals with noise used for exploration, this is the useful noise. In the appendix, we will look at one more example of noise: adaptive noise.

Part 1. We will see how researchers tried to overcome overestimation in models. First step is decoupling of the action selection from action evaluation. It was realized in Double DQN. The second step relates to the Actor-Critic architecture: here we decouple the value neural network (critic) from the policy neural network (actor). DDPG and TD3 use this architecture.

Part 2. Exploration as a major challenge of learning. The main issue is exploration noise. We relate to models DQNDouble DQN, DDPG and TD3. Neural network models using some noise parameters have more capabilities for exploration and are more successful in Deep RL algorithms.

Appendix. We consider the Hill-Climbing, the simple gradient-free algorithm. This algorithm adds adaptive noise directly to input variables, namely to the weight matrix determining the neural network.

Part 1. In efforts to overcome overestimation

DQN and Double DQN algorithms turned out to be very successful in the case of discrete action spaces. However, it is known that these algorithms suffer from overestimation. This harmful property is much worse than underestimation, because underestimation does not accumulate. Let us see how researchers tried to overcome overestimation.

Overestimation in DQN.

The problem is in maximization operator using for the calculation of the target value Gt. Suppose, the evaluation value for Q(S_{t+1}a) is already overestimated. Then from DQN key equations (see below) the agent observes that error also accumulates for Q(S_ta) .

DQN key equation Q(s_t, a_t)

Here, Rt is the reward at time t; Gt is the cumulative reward also know as TD-target; Q(s, a) is the Q-value table of the shape [space x action].

Thrun and Schwartz in “Issues in Using Function Approximation for Reinforcement Learning” (1993) observed that using function approximators (i.e, neural networks) instead of just lookup tables (this is the basic technique of Q-learning) causes some noise on the output predictions. They gave an example in which the overestimation asymptotically lead to suboptimal policies.

Decoupling in Double DQN.

In 2015, Haselt et. al. in “Deep Reinforcement Learning with Double Q-learning” shown that estimation errors can drive the estimates up and away from the true optimal values. They supposed the solution that reduces the overestimation: Double DQN.

The important thing that has been done in Double DQN is decoupling of the action selection from action evaluation. Let us make this clear.

Gt formula for DQN and Double DQN
  • Gt formula for DQN: the Q-value Q(S_ta) used for the action selection (in red) and the Q-value Q(S_t, a) used for the action evaluation (in blue) are determined by the same neural network with the weight vector θ_t.
  • Gt formula for Double DQN: the Q-value used for the action selection and the Q-value used for the action evaluation are determined by two different neural networks with weight vectors θ_t and θ'_t. These networks are called current and target.

However, due to the slowly changing policy, estimates of the value of the current and target neural networks are still too similar, and this still cases a consistent overestimation.

Actor-Critic architecture in DDPG.

DDPG is one of the first algorithms that tried to use the Q-learning technique of DQN models for continuous action spaces. DDPG stands for Deep Deterministic Policy Gradient. In this case, we cannot use the maximization operator of Q-values over all actions, however, we can use the function approximator, a neural network representing Q-values. We presume that there exists a certain function Q(s, a) which is differentiable with respect to the action argument a.However, finding argmax(Q(S_ta)) on all actions a for the given state S_t means that we must to solve the optimization task on every time step. This is a very expensive task. To overcome this obstacle, a group of researchers from DeepMind in the work “Continuous control with deep reinforcement learning” used the Actor-Critic architecture. They used two neural networks: one, as before, in DQNQ-network representing Q-values; another one is the actor function 𝜋(s) which provides a*, the maximum for the value function Q(s, a) as follows:

Actor function 𝜋(s)

Part 2. Exploration as a major challenge of learning

Why explore?

In addition to overestimation, there is another problem in Deep RL, no less difficult. This is exploration. We cannot unconditionally believe in maximum values of the Q-table or in the value of a* = 𝜋(s). Why not? Firstly, at the beginning of training, the corresponding neural network is still “young and stupid”, and its maximum values are far from reality. Secondly, perhaps not the maximum values will lead us to the optimal strategy after hard training.

In life, we often have to solve the following problem: to follow the beaten path — there is little risk and little reward; or to take a new unknown path with great risk — but, with some probability, a big win is possible there. Maybe it will be just super, you don’t know.

Exploration vs. exploitation

Exploitation means, that the agent uses the accumulated knowledge to select the following action. In our case, this means that for the given state, the agent finds the following action that maximizes the Q-value. The exploration means that the following action will be selected randomly.

There is no rule that determines which strategy is better: exploration or exploitation. The real goal is to find a true balance between these two strategies. As we can see, the balance strategy changes in the learning process.

Exploration in DQN and Double DQN

One way to ensure adequate exploration in DQN and Double DQN is to use the annealingε-greedy mechanism. For the first episodes, exploitation is selected with a small probability, for example, 0.02 (i.e., the action will be chosen very randomly) and the exploration is selected with a probability 0.98. Starting from a certain number of episode , the exploration will be performed with a minimal probability ε_m, for example, ε_m0.01, and the exploitation is chosen with probability 0.99. The probability formula of exploration ε can be realized as follows:

Annealing ε-greedy mechanism, probability formula of exploration ε

where i is the episode number. Let Mε = 100, ε_m = 0.01. Then the probability ε of exploration looks as follows:

Gradual decrease in probability from 1 to ε_m = 0.01

Exploration in DDPG

In RL models with continuous action spaces, instead of ε-greedy mechanism undirected exploration is applied. This method is used in DDPG PPO and other continuous control algorithms. Authors of DDPG (Lillicrap et al., 2015) constructed undirected exploration policy 𝜋’ by adding noise sampled from a noise process N to the actor policy 𝜋(s):

Policy 𝜋(s) with exploration noise

where is the noise given by Ornstein-Uhlenbeck, correlated noise process. In the TD3 paper authors (Fujimoto et. al., 2018) proposed to use the classic Gaussian noise, this is the quote:

…we use an off-policy exploration strategy, adding Gaussian noise N(0; 0:1) to each action. Unlike the original implementation of DDPG, we used uncorrelated noise for exploration as we found noise drawn from the Ornstein-Uhlenbeck (Uhlenbeck & Ornstein, 1930) process offered no performance benefits.

A common failure mode for DDPG is that the learned Q-function begins to overestimate Q-values, then the policy (actor function) leads to significant errors.

Exploration in TD3

The name TD3 stands for Twin Delayed Deep DeterministicTD3 retains the Actor-Critic architecture used in DDPG, and adds 3 new properties that greatly help to overcome overestimation:

  • TD3 maintains a pair of critics Q1 amd Q2 (hence the name “twin”) along with a single actor. For each time step, TD3 uses the smaller of the two Q-values.
  • TD3 updates the policy (and target networks) less frequently than the Q-function updates (one policy update (actor) for every two Q-function (critic) updates)
  • TD3 adds exploration noise to the target action. TD3 uses Gaussian noise, not Ornstein-Uhlenbeck noise as in DDPG.

Exploration noise in trials with PyBullet Hopper

PyBullet is a Python module for robotics and Deep RL based on the Bullet Physics SDKLet us look at HopperBulletEnv, one of PyBullet environments associated with articulated bodies:

Trained agent for HopperBulletEnv

The HopperBulletEnv environment is considered solved if the achieved score exceeds 2500. In TD3 trials with the HopperBulletEnv environment, I got, among others, the following results for std = 0.1 and std = 0.3:

Two trials for HopperBulletEnv with TD3, std of noise= 0.1 and 0.3

Here, std is the standard deviation of exploration noise in TD3. In both trials, threshold 2500 was not reached. However, I noticed the following oddities.

  • In trial std = 0.3, there are a lot of values ​​near 2500 (however less than 2500) and at the same time, the average value decreases all the time. This is explained as follows: the number of small values prevails over the number of large values, and the difference between these numbers increases.
  • In trial std = 0.1, the average values ​​reach large values ​​but in general, the values ​​decrease. The reason of this, again, is that the number of small values prevails over the number of large values.
  • It seemed to me that the prevalence of very small values ​​is associated with too big noise standard deviation. Then I decide to reduce std to 0.02, it was enough to solve the environment.

HopperBulletEnv with TD3, std of noise = 0.02

App. Hill-Climbing algorithm with adaptive noise

Forerunner of tensors

We illustrate the properties of the Hill-Climbing algorithm applied to the Cartpole environment. The neural network model here is so simple that does not use tensors (no PyTorch, no Tensorflow), the neural network uses only the simplest matrix of shape [4 x 2], that is the forerunner of tensors.

Class Policy in Hill-Climbing algorithm

The Hill-Climbing algorithm seeks to maximize a target function Go, which in our particular case is the cumulative discounted reward:

Cumulative discounted reward

where γ is the discount factor, 0 < γ < 1and Rk is the reward obtained at the time step k of the episode. The target function Go looks in Python as follows:

discounts = [gamma**i for i in range(len(rewards)+1)]
Go = sum([a*b for a,b in zip(discounts, rewards)])

As always in Deep RL, we try to cross a certain threshold. For Cartpole-v0, this threshold score is 195, and for Cartpole-v1 it is 475Hill-Climbing is a simple gradient-free algorithm (i.e., without using the gradient ascent or gradient descent method. We try to climb to the top of the curve by only changing the arguments of the target function Go using a certain adaptive noise. However, what is the argument of our target function?

The argument of Go is the weight matrix determining the neural network that underlies in our model. The weight matrix example for episodes 0–5 are presented here:

Weight vectors [4 x 2] of the neural network for episodes 0–5

Adaptive noise scale

The adaptive noise scaling for our model is realized as follows. If the current value of the target function is better than the best value obtained for the target function, we divide the noise scale by 2, and this noise is added to the weight matrix. If the current value of the target function is worse than the best obtained value, we multiply the noise scale by 2, and this noise is added to the best obtained value of the weight matrix. In both cases, a noise scale is added with some random factor different for any element of the matrix.

Noise Scale and Score graphs by episodes

For Cartpole-v1, if the weight matrix is initialized to non-zero small values (see above the left top matrix), the number of episodes = 112. Note that if the weight matrix is initialized to zeros then the number of episodes is increased from 112 to 168. The same for Cartpole-v0.

For more information on Cartpole-v0/Cartpole-v1 with adaptive noise scaling, see the project on Github.

A more generic formula for the noise scale

As we saw above, the noise scale adaptively increases or decreases depending on whether the target function is lower or higher than the best obtained value. The noise scale in this algorithm is 2. In the paper “Parameter Space Noise for Exploration” authors considers more generic formula:

Adaptive noise scale

where α is a noise scale, d is a certain distance measure between perturbed and non-perturbed policy, and δ is a threshold value. In Appendix C, authors consider the possible forms of the distance function d for algorithms DQNDDPG and TPRO.

References

[1] S.Thrun and A.Schwartz, Issues in Using Function Approximation for Reinforcement Learning, (1993), Carnegie Mellon University, The Robotics Institute

[2] H.van Hasselt et. al., Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461

[3] T.P. Lillicrap et.al., Continuous control with deep reinforcement learning (2015), arXiv:1509.02971

[4] Yuxi Li, Deep Reinforcement Learning: An Overview (2018), arXiv:1701.07274v6

[5] S.Fujimoto et.al, Addressing Function Approximation Error in Actor-Critic Methods (2018), arXiv: arXiv:1802.09477v3

[6] Better Exploration with Parameter Noise, OpenAI.com, https://openai.com/blog/better-exploration-with-parameter-noise/

[7] M.Plappert et.al. , Parameter Space Noise for Exploration, OpenAI, arXiv:1706.01905v2, ICLR 2018

[8] B.Mahyavanshi, Introduction to Hill Climbing | Artificial Intelligence, Medium, 2019

[9] Deep Deterministic Policy Gradient, OpenAI, Spinning Up, https://spinningup.openai.com/en/latest/algorithms/ddpg.html

[10] What Does Stochastic Mean in Machine Learning? (2019), Machine Learning Mastery,
https://machinelearningmastery.com/stochastic-in-machine-learning/

[11] C. Colas et. al., GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithm (2018), arXiv:1802.05054

[12]https://en.wikipedia.org/wiki/Ornstein–Uhlenbeck_process, Ornstein–Uhlenbeck process

[13] E.Lindwurm, Intuition: Exploration vs Exploitation (2019), TowardsDataScience

[14] M.Watts, Introduction to Reinforcement Learning (DDPG and TD3) for News Recommendation (2019), TowardsDataScience

[15] T.Stafford, Fundamentals of learning: the exploration-exploitation trade-off (2012), https://tomstafford.staff.shef.ac.uk/?p=48

[16] Bullet Real-Time Physics Simulation (2020), https://pybullet.org/wordpress/

[17] R.Stekolshchik, A pair of interrelated neural networks in DQN (2020), TowardsDataScience

[18] R.Stekolshchik, How does the Bellman equation work in Deep RL? (2020), TowardsDataScience

Original post: https://towardsdatascience.com/three-aspects-of-deep-rl-noise-overestimation-and-exploration-122ffb4bb92b

84 comentários em “Three aspects of Deep RL: noise, overestimation and exploration

  1. I’m the business owner of JustCBD Store company (justcbdstore.com) and I’m presently seeking to grow my wholesale side of business. I really hope that anybody at targetdomain share some guidance 🙂 I considered that the most suitable way to do this would be to reach out to vape shops and cbd retailers. I was hoping if anybody at all could recommend a dependable site where I can buy Vape Shop Business Sales Leads I am currently taking a look at creativebeartech.com, theeliquidboutique.co.uk and wowitloveithaveit.com. On the fence which one would be the most suitable option and would appreciate any support on this. Or would it be easier for me to scrape my own leads? Suggestions?

  2. I’m the manager of JustCBD label (justcbdstore.com) and am planning to grow my wholesale side of company. I really hope that anybody at targetdomain can help me . I considered that the most effective way to do this would be to reach out to vape stores and cbd stores. I was hoping if someone could suggest a qualified site where I can get Vape Shop B2B Data List I am already examining creativebeartech.com, theeliquidboutique.co.uk and wowitloveithaveit.com. Not exactly sure which one would be the best selection and would appreciate any assistance on this. Or would it be much simpler for me to scrape my own leads? Ideas?

  3. Aw, this was an incredibly good post. Spending some time and actual effort to produce a really good article… but what can I say… I procrastinate a whole lot and never manage to get nearly anything done.

  4. An outstanding share! I have just forwarded this onto a coworker who has been doing a little homework on this. And he in fact ordered me dinner simply because I discovered it for him… lol. So let me reword this…. Thanks for the meal!! But yeah, thanx for spending time to discuss this topic here on your site.

  5. Oh my goodness! Incredible article dude! Thanks, However I am going through troubles with your RSS. I don’t understand the reason why I am unable to subscribe to it. Is there anyone else getting similar RSS issues? Anyone who knows the answer can you kindly respond? Thanx!!

  6. Hello there! This article couldn’t be written much better! Looking at this article reminds me of my previous roommate! He continually kept talking about this. I’ll send this information to him. Pretty sure he’ll have a great read. I appreciate you for sharing!

  7. Everything is very open with a really clear description of the challenges. It was truly informative. Your site is very useful. Many thanks for sharing!

  8. A fascinating discussion is definitely worth comment. There’s no doubt that that you need to publish more on this topic, it might not be a taboo matter but generally people don’t discuss these subjects. To the next! Many thanks!!

  9. This is the perfect website for anyone who would like to understand this topic. You know a whole lot its almost tough to argue with you (not that I really will need to…HaHa). You definitely put a new spin on a topic that has been written about for many years. Wonderful stuff, just wonderful!

  10. Hi, I do think this is an excellent blog. I stumbledupon it 😉 I am going to come back once again since i have book marked it. Money and freedom is the best way to change, may you be rich and continue to help other people.

  11. I’m excited to find this web site. I wanted to thank you for your time for this fantastic read!! I definitely liked every bit of it and i also have you saved to fav to look at new information on your website.

  12. Howdy! This blog post couldn’t be written any better! Looking at this article reminds me of my previous roommate! He constantly kept talking about this. I most certainly will forward this post to him. Pretty sure he’s going to have a great read. I appreciate you for sharing!

  13. Right here is the right website for anybody who would like to understand this topic. You know a whole lot its almost tough to argue with you (not that I actually would want to…HaHa). You certainly put a new spin on a topic that has been written about for years. Great stuff, just wonderful!

  14. Nice post. I learn something new and challenging on sites I stumbleupon every day. It will always be helpful to read articles from other authors and practice a little something from other web sites.

  15. Howdy! I simply would like to offer you a huge thumbs up for your excellent info you have right here on this post. I will be coming back to your web site for more soon.

  16. After I initially commented I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on whenever a comment is added I recieve 4 emails with the same comment. There has to be a means you are able to remove me from that service? Thanks!

  17. I’m more than happy to uncover this website. I want to to thank you for your time just for this fantastic read!! I definitely loved every part of it and i also have you bookmarked to look at new information on your site.

  18. Nice post. I learn something new and challenging on websites I stumbleupon on a daily basis. It’s always interesting to read articles from other authors and use something from their web sites.

  19. Hi there! This blog post could not be written much better! Looking at this post reminds me of my previous roommate! He continually kept talking about this. I most certainly will send this post to him. Pretty sure he’s going to have a good read. Thank you for sharing!

  20. Hi there, There’s no doubt that your website may be having web browser compatibility problems. Whenever I look at your blog in Safari, it looks fine but when opening in I.E., it has some overlapping issues. I merely wanted to give you a quick heads up! Other than that, wonderful site!

  21. This is the perfect blog for anybody who wants to understand this topic. You realize so much its almost tough to argue with you (not that I personally will need to…HaHa). You certainly put a brand new spin on a subject that’s been discussed for many years. Wonderful stuff, just great!

  22. Having read this I believed it was really informative. I appreciate you taking the time and energy to put this content together. I once again find myself personally spending a lot of time both reading and leaving comments. But so what, it was still worthwhile!

  23. Hello there! This blog post couldn’t be written much better! Looking at this post reminds me of my previous roommate! He continually kept preaching about this. I most certainly will forward this post to him. Pretty sure he’s going to have a great read. I appreciate you for sharing!

  24. Nice post. I learn something new and challenging on websites I stumbleupon everyday. It’s always useful to read through articles from other writers and practice something from their sites.

  25. I’m impressed, I must say. Seldom do I encounter a blog that’s both educative and interesting, and let me tell you, you’ve hit the nail on the head. The problem is something which not enough men and women are speaking intelligently about. I am very happy that I found this during my search for something concerning this.

  26. I was extremely pleased to find this site. I need to to thank you for ones time due to this wonderful read!! I definitely savored every little bit of it and I have you saved as a favorite to see new information on your site.

  27. Greetings, There’s no doubt that your blog might be having browser compatibility problems. Whenever I take a look at your website in Safari, it looks fine but when opening in IE, it’s got some overlapping issues. I just wanted to give you a quick heads up! Aside from that, excellent website!

  28. Woah! I’m really loving the template/theme of this site. It’s simple, yet effective. A lot of times it’s very difficult to get that “perfect balance” between user friendliness and visual appearance. I must say you have done a awesome job with this. In addition, the blog loads extremely fast for me on Chrome. Excellent Blog!|

  29. Hi there, I do believe your website may be having web browser compatibility problems. When I take a look at your site in Safari, it looks fine however, if opening in I.E., it has some overlapping issues. I just wanted to provide you with a quick heads up! Besides that, fantastic website!

  30. You’re so interesting! I don’t believe I’ve truly read a single thing like this before. So wonderful to discover somebody with some original thoughts on this subject. Really.. thanks for starting this up. This web site is one thing that is required on the web, someone with a bit of originality!

  31. You’re so interesting! I do not believe I’ve read anything like that before. So nice to find another person with original thoughts on this subject matter. Really.. many thanks for starting this up. This site is one thing that is required on the internet, someone with some originality!

  32. Achieving your fitness goals doesn’t need a certified personal trainer or an expensive gym memberships, especially if you have the budget and the space to consider practically every workout machine on the market.

  33. After exploring a few of the blog posts on your blog, I really appreciate your technique of blogging. I added it to my bookmark webpage list and will be checking back soon. Take a look at my web site too and tell me what you think.

  34. You’re so interesting! I do not think I’ve read a single thing like this before. So great to find somebody with a few unique thoughts on this subject. Really.. many thanks for starting this up. This site is something that’s needed on the internet, someone with a little originality!

  35. You are so interesting! I don’t suppose I’ve read a single thing like that before. So nice to discover somebody with genuine thoughts on this subject matter. Seriously.. thanks for starting this up. This website is something that is required on the web, someone with a little originality!

  36. An outstanding share! I have just forwarded this onto a colleague who was doing a little homework on this. And he actually ordered me breakfast due to the fact that I discovered it for him… lol. So let me reword this…. Thanks for the meal!! But yeah, thanx for spending some time to discuss this matter here on your blog.

  37. Hi, I do believe this is an excellent web site. I stumbledupon it 😉 I’m going to return once again since i have book-marked it. Money and freedom is the greatest way to change, may you be rich and continue to help other people.

  38. I’m amazed, I have to admit. Rarely do I come across a blog that’s both educative and engaging, and let me tell you, you’ve hit the nail on the head. The issue is something which too few folks are speaking intelligently about. Now i’m very happy that I stumbled across this during my hunt for something concerning this.

  39. Good post. I learn something totally new and challenging on blogs I stumbleupon every day. It will always be useful to read through articles from other writers and use a little something from other web sites.

  40. Right here is the right webpage for everyone who would like to understand this topic. You know a whole lot its almost hard to argue with you (not that I personally would want to…HaHa). You certainly put a new spin on a topic which has been discussed for ages. Great stuff, just excellent!

Leave a Reply

Your email address will not be published. Required fields are marked *