• Artificial intelligence in water desalination

We touch on various sides of noise in Deep Reinforcement Learning models. Part 1 discusses overestimation, that is the harmful property resulting from noise. Parts 2 deals with noise used for exploration, this is the useful noise. In the appendix, we will look at one more example of noise: adaptive noise.

Part 1. We will see how researchers tried to overcome overestimation in models. First step is decoupling of the action selection from action evaluation. It was realized in Double DQN. The second step relates to the Actor-Critic architecture: here we decouple the value neural network (critic) from the policy neural network (actor). DDPG and TD3 use this architecture.

Part 2. Exploration as a major challenge of learning. The main issue is exploration noise. We relate to models DQNDouble DQN, DDPG and TD3. Neural network models using some noise parameters have more capabilities for exploration and are more successful in Deep RL algorithms.

Appendix. We consider the Hill-Climbing, the simple gradient-free algorithm. This algorithm adds adaptive noise directly to input variables, namely to the weight matrix determining the neural network.

Part 1. In efforts to overcome overestimation

DQN and Double DQN algorithms turned out to be very successful in the case of discrete action spaces. However, it is known that these algorithms suffer from overestimation. This harmful property is much worse than underestimation, because underestimation does not accumulate. Let us see how researchers tried to overcome overestimation.

Overestimation in DQN.

The problem is in maximization operator using for the calculation of the target value `Gt`. Suppose, the evaluation value for `Q(S_{t+1}, a)` is already overestimated. Then from DQN key equations (see below) the agent observes that error also accumulates for `Q(S_t, a)` .

Here, `Rt` is the reward at time `t;` `Gt` is the cumulative reward also know as TD-target; Q(s, a) is the Q-value table of the shape `[space x action]`.

Thrun and Schwartz in “Issues in Using Function Approximation for Reinforcement Learning” (1993) observed that using function approximators (i.e, neural networks) instead of just lookup tables (this is the basic technique of Q-learning) causes some noise on the output predictions. They gave an example in which the overestimation asymptotically lead to suboptimal policies.

Decoupling in Double DQN.

In 2015, Haselt et. al. in “Deep Reinforcement Learning with Double Q-learning” shown that estimation errors can drive the estimates up and away from the true optimal values. They supposed the solution that reduces the overestimation: Double DQN.

The important thing that has been done in Double DQN is decoupling of the action selection from action evaluation. Let us make this clear.

• Gt formula for DQN: the Q-value `Q(S_t, a)` used for the action selection (in red) and the Q-value `Q(S_t, a)` used for the action evaluation (in blue) are determined by the same neural network with the weight vector `θ_t`.
• Gt formula for Double DQN: the Q-value used for the action selection and the Q-value used for the action evaluation are determined by two different neural networks with weight vectors `θ_t `and `θ'_t.` These networks are called current and target.

However, due to the slowly changing policy, estimates of the value of the current and target neural networks are still too similar, and this still cases a consistent overestimation.

Actor-Critic architecture in DDPG.

DDPG is one of the first algorithms that tried to use the Q-learning technique of DQN models for continuous action spaces. DDPG stands for Deep Deterministic Policy Gradient. In this case, we cannot use the maximization operator of Q-values over all actions, however, we can use the function approximator, a neural network representing Q-values. We presume that there exists a certain function `Q(s, a)` which is differentiable with respect to the action argument `a.`However, finding `argmax(Q(S_t, a))` on all actions `a` for the given state `S_t` means that we must to solve the optimization task on every time step. This is a very expensive task. To overcome this obstacle, a group of researchers from DeepMind in the work “Continuous control with deep reinforcement learning” used the Actor-Critic architecture. They used two neural networks: one, as before, in DQNQ-network representing Q-values; another one is the actor function 𝜋(s) which provides a*, the maximum for the value function `Q(s, a)` as follows`:`

Part 2. Exploration as a major challenge of learning

Why explore?

In addition to overestimation, there is another problem in Deep RL, no less difficult. This is exploration. We cannot unconditionally believe in maximum values of the Q-table or in the value of a* = 𝜋(s). Why not? Firstly, at the beginning of training, the corresponding neural network is still “young and stupid”, and its maximum values are far from reality. Secondly, perhaps not the maximum values will lead us to the optimal strategy after hard training.

In life, we often have to solve the following problem: to follow the beaten path — there is little risk and little reward; or to take a new unknown path with great risk — but, with some probability, a big win is possible there. Maybe it will be just super, you don’t know.

Exploration vs. exploitation

Exploitation means, that the agent uses the accumulated knowledge to select the following action. In our case, this means that for the given state, the agent finds the following action that maximizes the Q-value. The exploration means that the following action will be selected randomly.

There is no rule that determines which strategy is better: exploration or exploitation. The real goal is to find a true balance between these two strategies. As we can see, the balance strategy changes in the learning process.

Exploration in DQN and Double DQN

One way to ensure adequate exploration in DQN and Double DQN is to use the annealing`ε`-greedy mechanism. For the first episodes, exploitation is selected with a small probability, for example, `0.02` (i.e., the action will be chosen very randomly) and the exploration is selected with a probability `0.98`. Starting from a certain number of episode `Mε`, the exploration will be performed with a minimal probability `ε_m,` for example, `ε_m= 0.01,` and the exploitation is chosen with probability `0.99.` The probability formula of exploration `ε` can be realized as follows:

where `i` is the episode number. Let `Mε = 100, ε_m = 0.01.` Then the probability `ε` of exploration looks as follows:

Exploration in DDPG

In RL models with continuous action spaces, instead of `ε`-greedy mechanism undirected exploration is applied. This method is used in DDPG PPO and other continuous control algorithms. Authors of DDPG (Lillicrap et al., 2015) constructed undirected exploration policy `𝜋’` by adding noise sampled from a noise process `N` to the actor policy `𝜋(s)`:

where is the noise given by Ornstein-Uhlenbeck, correlated noise process. In the TD3 paper authors (Fujimoto et. al., 2018) proposed to use the classic Gaussian noise, this is the quote:

…we use an off-policy exploration strategy, adding Gaussian noise N(0; 0:1) to each action. Unlike the original implementation of DDPG, we used uncorrelated noise for exploration as we found noise drawn from the Ornstein-Uhlenbeck (Uhlenbeck & Ornstein, 1930) process offered no performance benefits.

A common failure mode for DDPG is that the learned Q-function begins to overestimate Q-values, then the policy (actor function) leads to significant errors.

Exploration in TD3

The name TD3 stands for Twin Delayed Deep DeterministicTD3 retains the Actor-Critic architecture used in DDPG, and adds 3 new properties that greatly help to overcome overestimation:

• TD3 maintains a pair of critics Q1 amd Q2 (hence the name “twin”) along with a single actor. For each time step, TD3 uses the smaller of the two Q-values.
• TD3 updates the policy (and target networks) less frequently than the Q-function updates (one policy update (actor) for every two Q-function (critic) updates)
• TD3 adds exploration noise to the target action. TD3 uses Gaussian noise, not Ornstein-Uhlenbeck noise as in DDPG.

Exploration noise in trials with PyBullet Hopper

PyBullet is a Python module for robotics and Deep RL based on the Bullet Physics SDKLet us look at HopperBulletEnv, one of PyBullet environments associated with articulated bodies:

The HopperBulletEnv environment is considered solved if the achieved score exceeds 2500. In TD3 trials with the HopperBulletEnv environment, I got, among others, the following results for `std = 0.1` and `std = 0.3`:

Here, `std` is the standard deviation of exploration noise in TD3. In both trials, threshold 2500 was not reached. However, I noticed the following oddities.

• In trial `std = 0.3`, there are a lot of values ​​near 2500 (however less than 2500) and at the same time, the average value decreases all the time. This is explained as follows: the number of small values prevails over the number of large values, and the difference between these numbers increases.
• In trial `std = 0.1`, the average values ​​reach large values ​​but in general, the values ​​decrease. The reason of this, again, is that the number of small values prevails over the number of large values.
• It seemed to me that the prevalence of very small values ​​is associated with too big noise standard deviation. Then I decide to reduce `std` to `0.02`, it was enough to solve the environment.

App. Hill-Climbing algorithm with adaptive noise

Forerunner of tensors

We illustrate the properties of the Hill-Climbing algorithm applied to the Cartpole environment. The neural network model here is so simple that does not use tensors (no PyTorch, no Tensorflow), the neural network uses only the simplest matrix of shape `[4 x 2]`, that is the forerunner of tensors.

The Hill-Climbing algorithm seeks to maximize a target function `Go`, which in our particular case is the cumulative discounted reward:

where `γ` is the discount factor, `0 < γ < 1`and `Rk` is the reward obtained at the time step `k` of the episode. The target function `Go` looks in Python as follows:

```discounts = [gamma**i for i in range(len(rewards)+1)]
Go = sum([a*b for a,b in zip(discounts, rewards)])```

As always in Deep RL, we try to cross a certain threshold. For Cartpole-v0, this threshold score is `195`, and for Cartpole-v1 it is `475`Hill-Climbing is a simple gradient-free algorithm (i.e., without using the gradient ascent or gradient descent method. We try to climb to the top of the curve by only changing the arguments of the target function `Go` using a certain adaptive noise. However, what is the argument of our target function?

The argument of `Go` is the weight matrix determining the neural network that underlies in our model. The weight matrix example for episodes 0–5 are presented here:

The adaptive noise scaling for our model is realized as follows. If the current value of the target function is better than the best value obtained for the target function, we divide the noise scale by `2`, and this noise is added to the weight matrix. If the current value of the target function is worse than the best obtained value, we multiply the noise scale by `2`, and this noise is added to the best obtained value of the weight matrix. In both cases, a noise scale is added with some random factor different for any element of the matrix.

For Cartpole-v1, if the weight matrix is initialized to non-zero small values (see above the left top matrix), the number of episodes = `112`. Note that if the weight matrix is initialized to zeros then the number of episodes is increased from `112` to `168`. The same for Cartpole-v0.

A more generic formula for the noise scale

As we saw above, the noise scale adaptively increases or decreases depending on whether the target function is lower or higher than the best obtained value. The noise scale in this algorithm is `2`. In the paper “Parameter Space Noise for Exploration” authors considers more generic formula:

where `α` is a noise scale, `d` is a certain distance measure between perturbed and non-perturbed policy, and `δ` is a threshold value. In Appendix C, authors consider the possible forms of the distance function `d` for algorithms DQNDDPG and TPRO.

References

[1] S.Thrun and A.Schwartz, Issues in Using Function Approximation for Reinforcement Learning, (1993), Carnegie Mellon University, The Robotics Institute

[2] H.van Hasselt et. al., Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461

[3] T.P. Lillicrap et.al., Continuous control with deep reinforcement learning (2015), arXiv:1509.02971

[4] Yuxi Li, Deep Reinforcement Learning: An Overview (2018), arXiv:1701.07274v6

[5] S.Fujimoto et.al, Addressing Function Approximation Error in Actor-Critic Methods (2018), arXiv: arXiv:1802.09477v3

[6] Better Exploration with Parameter Noise, OpenAI.com, https://openai.com/blog/better-exploration-with-parameter-noise/

[7] M.Plappert et.al. , Parameter Space Noise for Exploration, OpenAI, arXiv:1706.01905v2, ICLR 2018

[8] B.Mahyavanshi, Introduction to Hill Climbing | Artificial Intelligence, Medium, 2019

[9] Deep Deterministic Policy Gradient, OpenAI, Spinning Up, https://spinningup.openai.com/en/latest/algorithms/ddpg.html

[10] What Does Stochastic Mean in Machine Learning? (2019), Machine Learning Mastery,
https://machinelearningmastery.com/stochastic-in-machine-learning/

[11] C. Colas et. al., GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithm (2018), arXiv:1802.05054

[12]https://en.wikipedia.org/wiki/Ornstein–Uhlenbeck_process, Ornstein–Uhlenbeck process

[13] E.Lindwurm, Intuition: Exploration vs Exploitation (2019), TowardsDataScience

[14] M.Watts, Introduction to Reinforcement Learning (DDPG and TD3) for News Recommendation (2019), TowardsDataScience

[15] T.Stafford, Fundamentals of learning: the exploration-exploitation trade-off (2012), https://tomstafford.staff.shef.ac.uk/?p=48

[16] Bullet Real-Time Physics Simulation (2020), https://pybullet.org/wordpress/

[17] R.Stekolshchik, A pair of interrelated neural networks in DQN (2020), TowardsDataScience

[18] R.Stekolshchik, How does the Bellman equation work in Deep RL? (2020), TowardsDataScience

84 comentários em “Three aspects of Deep RL: noise, overestimation and exploration”

1. I’m the business owner of JustCBD Store company (justcbdstore.com) and I’m presently seeking to grow my wholesale side of business. I really hope that anybody at targetdomain share some guidance 🙂 I considered that the most suitable way to do this would be to reach out to vape shops and cbd retailers. I was hoping if anybody at all could recommend a dependable site where I can buy Vape Shop Business Sales Leads I am currently taking a look at creativebeartech.com, theeliquidboutique.co.uk and wowitloveithaveit.com. On the fence which one would be the most suitable option and would appreciate any support on this. Or would it be easier for me to scrape my own leads? Suggestions?

2. I’m the manager of JustCBD label (justcbdstore.com) and am planning to grow my wholesale side of company. I really hope that anybody at targetdomain can help me . I considered that the most effective way to do this would be to reach out to vape stores and cbd stores. I was hoping if someone could suggest a qualified site where I can get Vape Shop B2B Data List I am already examining creativebeartech.com, theeliquidboutique.co.uk and wowitloveithaveit.com. Not exactly sure which one would be the best selection and would appreciate any assistance on this. Or would it be much simpler for me to scrape my own leads? Ideas?

3. Aw, this was an incredibly good post. Spending some time and actual effort to produce a really good article… but what can I say… I procrastinate a whole lot and never manage to get nearly anything done.

4. An outstanding share! I have just forwarded this onto a coworker who has been doing a little homework on this. And he in fact ordered me dinner simply because I discovered it for him… lol. So let me reword this…. Thanks for the meal!! But yeah, thanx for spending time to discuss this topic here on your site.

5. Good web site you have here.. It’s hard to find excellent writing like yours nowadays. I seriously appreciate people like you! Take care!!

6. Oh my goodness! Incredible article dude! Thanks, However I am going through troubles with your RSS. I don’t understand the reason why I am unable to subscribe to it. Is there anyone else getting similar RSS issues? Anyone who knows the answer can you kindly respond? Thanx!!

7. Excellent post! We will be linking to this great content on our site. Keep up the good writing.

9. I like looking through a post that will make people think. Also, thank you for permitting me to comment!

10. This page truly has all the info I needed concerning this subject and didn’t know who to ask.

11. Everything is very open with a precise clarification of the challenges. It was definitely informative. Your site is very useful. Thanks for sharing!

12. Great info. Lucky me I recently found your blog by accident (stumbleupon). I’ve bookmarked it for later!

13. It’s nearly impossible to find educated people in this particular topic, but you sound like you know what you’re talking about! Thanks

14. Everything is very open with a really clear description of the challenges. It was truly informative. Your site is very useful. Many thanks for sharing!

15. That is a great tip especially to those fresh to the blogosphere. Brief but very precise information… Many thanks for sharing this one. A must read article!

16. A fascinating discussion is definitely worth comment. There’s no doubt that that you need to publish more on this topic, it might not be a taboo matter but generally people don’t discuss these subjects. To the next! Many thanks!!

17. This is the perfect website for anyone who would like to understand this topic. You know a whole lot its almost tough to argue with you (not that I really will need to…HaHa). You definitely put a new spin on a topic that has been written about for many years. Wonderful stuff, just wonderful!

18. Hi, I do think this is an excellent blog. I stumbledupon it 😉 I am going to come back once again since i have book marked it. Money and freedom is the best way to change, may you be rich and continue to help other people.

19. I’m excited to find this web site. I wanted to thank you for your time for this fantastic read!! I definitely liked every bit of it and i also have you saved to fav to look at new information on your website.

21. Excellent article! We will be linking to this particularly great article on our site. Keep up the great writing.

22. Howdy! This blog post couldn’t be written any better! Looking at this article reminds me of my previous roommate! He constantly kept talking about this. I most certainly will forward this post to him. Pretty sure he’s going to have a great read. I appreciate you for sharing!

23. This is a topic that’s close to my heart… Many thanks! Where are your contact details though?

24. Right here is the right website for anybody who would like to understand this topic. You know a whole lot its almost tough to argue with you (not that I actually would want to…HaHa). You certainly put a new spin on a topic that has been written about for years. Great stuff, just wonderful!

25. Nice post. I learn something new and challenging on sites I stumbleupon every day. It will always be helpful to read articles from other authors and practice a little something from other web sites.

26. Howdy! I simply would like to offer you a huge thumbs up for your excellent info you have right here on this post. I will be coming back to your web site for more soon.

27. Great post. I am experiencing many of these issues as well..

28. After I initially commented I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on whenever a comment is added I recieve 4 emails with the same comment. There has to be a means you are able to remove me from that service? Thanks!

29. I’m more than happy to uncover this website. I want to to thank you for your time just for this fantastic read!! I definitely loved every part of it and i also have you bookmarked to look at new information on your site.

31. I enjoy looking through a post that will make men and women think. Also, many thanks for allowing for me to comment!

32. Nice post. I learn something new and challenging on websites I stumbleupon on a daily basis. It’s always interesting to read articles from other authors and use something from their web sites.

33. This website was… how do I say it? Relevant!! Finally I have found something which helped me. Kudos!

34. Pretty! This has been an extremely wonderful post. Thank you for providing these details.

35. Hi there! This blog post could not be written much better! Looking at this post reminds me of my previous roommate! He continually kept talking about this. I most certainly will send this post to him. Pretty sure he’s going to have a good read. Thank you for sharing!

36. Hi there, There’s no doubt that your website may be having web browser compatibility problems. Whenever I look at your blog in Safari, it looks fine but when opening in I.E., it has some overlapping issues. I merely wanted to give you a quick heads up! Other than that, wonderful site!

37. Pretty! This has been an incredibly wonderful post. Thank you for supplying this info.

38. This is the perfect blog for anybody who wants to understand this topic. You realize so much its almost tough to argue with you (not that I personally will need to…HaHa). You certainly put a brand new spin on a subject that’s been discussed for many years. Wonderful stuff, just great!

39. Having read this I believed it was really informative. I appreciate you taking the time and energy to put this content together. I once again find myself personally spending a lot of time both reading and leaving comments. But so what, it was still worthwhile!

40. Good blog you’ve got here.. It’s difficult to find excellent writing like yours these days. I really appreciate individuals like you! Take care!!

41. Hello there! This blog post couldn’t be written much better! Looking at this post reminds me of my previous roommate! He continually kept preaching about this. I most certainly will forward this post to him. Pretty sure he’s going to have a great read. I appreciate you for sharing!

42. I quite like reading through a post that can make men and women think. Also, thank you for allowing me to comment!

43. Good post. I’m going through a few of these issues as well..

44. Hi, I do believe this is a great website. I stumbledupon it 😉 I am going to return yet again since i have book marked it. Money and freedom is the best way to change, may you be rich and continue to help others.

45. This website was… how do you say it? Relevant!! Finally I have found something that helped me. Thanks a lot!

46. Excellent article. I will be facing a few of these issues as well..

47. Nice post. I learn something new and challenging on websites I stumbleupon everyday. It’s always useful to read through articles from other writers and practice something from their sites.

48. This blog was… how do I say it? Relevant!! Finally I have found something which helped me. Appreciate it!

49. I’m impressed, I must say. Seldom do I encounter a blog that’s both educative and interesting, and let me tell you, you’ve hit the nail on the head. The problem is something which not enough men and women are speaking intelligently about. I am very happy that I found this during my search for something concerning this.

50. I was extremely pleased to find this site. I need to to thank you for ones time due to this wonderful read!! I definitely savored every little bit of it and I have you saved as a favorite to see new information on your site.

51. Greetings, There’s no doubt that your blog might be having browser compatibility problems. Whenever I take a look at your website in Safari, it looks fine but when opening in IE, it’s got some overlapping issues. I just wanted to give you a quick heads up! Aside from that, excellent website!

52. Excellent blog post. I absolutely appreciate this website. Thanks!

53. Woah! I’m really loving the template/theme of this site. It’s simple, yet effective. A lot of times it’s very difficult to get that “perfect balance” between user friendliness and visual appearance. I must say you have done a awesome job with this. In addition, the blog loads extremely fast for me on Chrome. Excellent Blog!|

54. Hi there, I do believe your website may be having web browser compatibility problems. When I take a look at your site in Safari, it looks fine however, if opening in I.E., it has some overlapping issues. I just wanted to provide you with a quick heads up! Besides that, fantastic website!

55. Way cool! Some very valid points! I appreciate you writing this post and the rest of the website is very good.

56. You’re so interesting! I don’t believe I’ve truly read a single thing like this before. So wonderful to discover somebody with some original thoughts on this subject. Really.. thanks for starting this up. This web site is one thing that is required on the web, someone with a bit of originality!

57. Spot on with this write-up, I seriously believe this amazing site needs a lot more attention. I’ll probably be returning to read more, thanks for the advice!

58. You’re so interesting! I do not believe I’ve read anything like that before. So nice to find another person with original thoughts on this subject matter. Really.. many thanks for starting this up. This site is one thing that is required on the internet, someone with some originality!

59. I wanted to thank you for this wonderful read!! I definitely enjoyed every little bit of it. I have you bookmarked to check out new things you post…

60. Very good blog post. I absolutely love this site. Thanks!

61. Achieving your fitness goals doesn’t need a certified personal trainer or an expensive gym memberships, especially if you have the budget and the space to consider practically every workout machine on the market.

62. I couldn’t refrain from commenting. Exceptionally well written!

63. After exploring a few of the blog posts on your blog, I really appreciate your technique of blogging. I added it to my bookmark webpage list and will be checking back soon. Take a look at my web site too and tell me what you think.

64. Spot on with this write-up, I truly believe this website needs much more attention. I’ll probably be back again to read more, thanks for the advice!

65. I needed to thank you for this great read!! I absolutely enjoyed every bit of it. I have got you book-marked to look at new stuff you post…

66. You’re so interesting! I do not think I’ve read a single thing like this before. So great to find somebody with a few unique thoughts on this subject. Really.. many thanks for starting this up. This site is something that’s needed on the internet, someone with a little originality!

67. You are so interesting! I don’t suppose I’ve read a single thing like that before. So nice to discover somebody with genuine thoughts on this subject matter. Seriously.. thanks for starting this up. This website is something that is required on the web, someone with a little originality!

68. Excellent web site you have here.. It’s hard to find quality writing like yours nowadays. I really appreciate people like you! Take care!!

69. I wanted to thank you for this excellent read!! I definitely loved every little bit of it. I have got you book marked to check out new things you post…

70. Way cool! Some very valid points! I appreciate you penning this post and the rest of the website is very good.

71. An outstanding share! I have just forwarded this onto a colleague who was doing a little homework on this. And he actually ordered me breakfast due to the fact that I discovered it for him… lol. So let me reword this…. Thanks for the meal!! But yeah, thanx for spending some time to discuss this matter here on your blog.

72. You ought to be a part of a contest for one of the most useful websites online. I most certainly will recommend this web site!

73. Hi, I do believe this is an excellent web site. I stumbledupon it 😉 I’m going to return once again since i have book-marked it. Money and freedom is the greatest way to change, may you be rich and continue to help other people.

74. I’m amazed, I have to admit. Rarely do I come across a blog that’s both educative and engaging, and let me tell you, you’ve hit the nail on the head. The issue is something which too few folks are speaking intelligently about. Now i’m very happy that I stumbled across this during my hunt for something concerning this.

75. Good post. I learn something totally new and challenging on blogs I stumbleupon every day. It will always be useful to read through articles from other writers and use a little something from other web sites.

76. Good post! We are linking to this particularly great article on our website. Keep up the good writing.

77. Right here is the right webpage for everyone who would like to understand this topic. You know a whole lot its almost hard to argue with you (not that I personally would want to…HaHa). You certainly put a new spin on a topic which has been discussed for ages. Great stuff, just excellent!

78. Hi my friend! I wish to say that this post is amazing, great written and
include approximately all significant infos. I would like to