Reinforcement learning stands to become the new gold standard in creating intelligent video game AI. The chief advantage of reinforcement learning(RL) over traditional game AI methods is that, rather than hand-crafting the AI’s logic using complicated behavior trees, with RL one simply rewards the behavior they wish the AI to manifest and the agent learns by itself to perform the necessary sequence of actions to achieve the desired behavior. In essence, this is how one might teach a dog to perform tricks using a food reward.
The RL approach to game AI can be used to train a variety of strategic behaviors, including path finding, NPC attack and defense, and almost every behavior a human is capable of exhibiting while playing a video game. State-of-the-art implementations include those used to defeat best in class human players at Chess, Go and multiplayer strategy video games. There are few limits on what strategic behaviors an RL algorithm can theoretically discover, however in practice, computational expense and environment complexity constrain the type of behaviors that one would want to implement using RL. It is therefore important to understand the basics of reinforcement learning as well as its suitability to the behavior you wish to train, before beginning.
Reinforcement learning is based on the ability of an agent make associations between things happening in their environment. However unlike other forms of learning like classical conditioning, RL takes this a step further and instead of just explaining how associations are made, reinforcement learning examines how one can modify a behavior according to the quality and strength of those associations in order to achieve some desired reward(or avoid punishment). In other words, it involves a strategic goal that is optimized over time.
What follows is a catalog of the elements that play a role in reinforcement learning algorithms, both as it unfolds in humans and for video game agents.
First, there must be an environment — that is the setting in which learning is unfolding. We can break this down a step further into what is called a “global environment state” which contains all the possible information that can be known about a particular setting and the agent’s personal environmental state, which is the subset of the global environment that the agent has access to. The environment of a poker game helps to illustrate this.
The global environment for a poker game includes all the cards, both the ones that are face up on the table for the players to see and the ones that are face down and hidden from the players. Every position of every card is contained within the global environment. Meanwhile, the agent’s environment is characterized by the cards that are revealed to them through their senses, that is the cards that have been turned up on the table and the cards that are in their hand. This is a smaller subset of the global environment state. Each of us only sees a small fraction of the global environmental state as represented by planet Earth and the universe beyond that. In the literature on reinforcement learning, such an environment belongs to what is called a Partially Observable Markov Decision Process. “Partially observable” because the agent has access to some but not all of the information contained within the global environment and “Markovian” because it satisfies the property of containing all the necessary information for the actor to succeed at learning a desired goal behavior. If I asked you to play poker against a space alien living on the planet Tao, without ever seeing your cards or knowing if you had won or lost, it would clearly be a hopeless task and utterly “unmarkovian”.
A further subset of the global environment often used in reinforcement learning, is what is called the Markov environment state which contains all the relevant information about the environment necessary to make an optimal decision about the future in regard to some particular goal. In essence, the Markov environmental state summarizes all the previous states of the environment such that no further information is needed to optimize decision making from that point onwards.
If you were trying to make a decision about how to avoid an oncoming baseball, but could not see that baseball or detect it in any way, the Markov environmental state would be incomplete. However, if you caught sight of the baseball out of the corner of your eye and could calculate its trajectory from that position, you wouldn’t need to know anything more about what happened before the ball was thrown in order to successfully dodge it. As such, it would fulfill the definition of a Markov state. In a game of checkers, if you were to enter midpoint in the game and glance at the board, you would not need to know anything about how the pieces got to their particular places at that time in order to formulate an optimal strategy going forward. Simply seeing the board in its present state would be enough. The Markov environment crops up frequently in more formal mathematical treatments of reinforcement learning, but since our focus here is on the intuition of the subject matter rather than the mathematics of it, we will not dwell on Markov environments.
The next actor in our reinforcement saga is the agent themselves. This is the “brain” that is capable of learning via reinforcement. In computers science, this would be the robot or synthetic character which contains the reinforcement learning algorithm. In biology, humans are examples of reinforcement learning agents. We must have some means for the agent to take actions in its environment to influence the rewards it receives. A brain inside a laboratory beaker would not be capable of doing much with reinforcement learning because it couldn’t effect its environment in any way. As such, there must be some way to parlay with the environment such that our agents actions influence the state of the environment.
Besides the agent themselves, we must also have a way of making observations about our environment after we have taken actions to notice their effects. If you were to connect arms to our hypothetical brain-in-a-beaker, but arms without any sense of feeling, than those arms might hypothetically be capable of doing a great deal of good, especially if it could build the rest of its body for the brain to ride around in. However, the arms would remain useless without having any way of observing the actions taken with them and gauging their effect. Consequently, the ability to make observations about the environment is another key component of reinforcement learning. The important thing to remember is that the agents observation space must include everything it needs to know to create a strategy for achieving the reward that it is being sought. For instance, an AI agent that must take several actions in a particular sequence to achieve a reward must have access to a memory of these past actions in the form of some array of variables. An agent that only ever knew its present action, would not be able to optimize around a goal that involved remembering a sequence of actions. In the same way, an agent who is being rewarded for taking an action during a specific environmental circumstance, must have access to observe each of those relevant environmental objects. A good rule of thumb is that whatever variables in the game environment are part of the agents reward function, discussed next, must also be part of the agent’s observation space so that it can learn to receive the reward.
Which brings us to the topic of rewards. Without some reward, or its opposite, a punishment, there is nothing to motivate learning a behavior. Before there can be any kind of learning, there must be a stick or a carrot, something to motivate the behavioral change. If things don’t have any associations, positive or negative, there is nothing to propel a change in behavior, nothing to chase and nothing to avoid. This is all contained in the agents reward function, and it fundamental to reinforcement learning. If you are training an NPC within a video game, your reward can be any combination of variables that combine within the game environment in a specific way. We can summarize the entire schema of these components and how they relate to each other in a simple model, seen in figure 1.
Figure 1 — Flow Diagram of Reinforcement Learning Components and Interaction: Learner takes an action, observes the environment, receives a reward or not, and then updates its strategy accordingly. This process is repeated, gradually improving the agent’s strategy over time with successive actions.
Now that we have all the components of reinforcement learning in one place, let’s look at the method by which they interact to create a strategic behavior. To summarize, we have an agent that can take actions, and depending on the state of the environment and its past actions, the agent is either rewarded or not. But for the magic of reinforcement learning to happen, when the agent is rewarded, it must propagate this reward back across the actions and environment states that led to it receiving the reward. This can be done in a variety of ways and the formulas for propagating a reward backwards across state action pairs can get a bit hairy. For the purposes of using RL for creating video game AI though one can get by just understanding that initially the agent must take random actions, and once it “accidentally” receives a reward through chance, that reward can be associated with the actions and observations that got it to where it was rewarded. When this process is repeated many times, the agent gradually uncovers a model of what actions lead to rewards, and which do not. Random trial and error training is therefore always a key component of reinforcement learning and the method by which an agent learns to receive a reward.
Most forms of reinforcement learning that one will encounter in programming video game AI will combine traditional RL as depicted above, with another class of machine learning called neural networks. This is where the “Deep” comes from in Deep Reinforcement Learning, because a deep neural network is baked in with the RL model to make it more scalable and robust.
One of the early problems with implementing reinforcement learning for video game AI was something called combinatorial explosion. If an agents needs to observe and interact with its environment to learn to achieve a reward, how big a table would we need to keep track of all those objects in its environment that might be affecting its reward? For example, four moveable objects in an environment could combine in at least 24 different ways, assuming they were capable of being lined up in a different order. 8 objects could combine in 40,320 ways! Beyond that and we get into numbers that are all but meaningless from a human perspective — 12 objects can combine in 479 million unique ways. That’s a pretty large table, even for a computer. But humans regularly succeed in tasks involving such combinatorial explosions. For instance, at any given time while playing an Atari video game, humans are abstracting from the total number of 33,600 pixels to just 4 or 5 different objects that we are keeping track of.
How then do we abstract from all those pixel combinations to get just a handful of meaningful features? In humans, evolution did all the hard work for us. While a computer could conceivably handle more features than a human depending on its processing power, combinatorial explosion can tax even modern day supercomputers. Fortunately for reinforcement learning, parallel developments happening in a field of AI called neural networks held out an answer.
Deep neural networks are most often associated with the field of AI called “supervised learning” which requires someone to provide labeled training data for the algorithm to learn patterns from. The amazing thing about deep neural networks is that they can take noisy, non-linear, non-uniform data like a picture of a cat, and abstract it down to a few features that are essential for categorizing it. This is how the spam classifier works in your email inbox and how Netflix creates recommendations for you based on the movies you liked or disliked. Such classifiers are increasingly common in software and have received a shot in the arm thanks to deep neural networks. Neural networks are more powerful than earlier statistical methods like logistic regression and Bayesian classifiers because they excel at finding patterns that are buried beneath layers of complexity. They are like the Hercules of classifiers, finding patterns in even the largest most convoluted training data. Such classifiers are powerful tools indeed, but unlike reinforcement learning, they require some sort of labeled data to train on and often the process of creating these training sets is a laborious and painstaking process.
It was the idea of using deep neural networks to abstract from the huge amount of combinations presented by an Atari screen to just a handful key features that proved revolutionary for reinforcement learning. Deep neural networks have the ability to take very noisy and large datasets and detect patterns within them. The screen of an Atari video game can be thought of as such a large and noisy dataset. By using the screen of Atari as the observation space for an RL algorithm combined with a neural network, they could reduce the complexity of all the pixel combinations to a number which correlated with the different moves a player could make, usually just 4 or 5 (Volodymyr Mnih, 2015). Were they simply lucky that this was possible? Not at all, rather, because these games were designed for humans and humans have only a limited ability to keep track of multiple features in their visual field, the games were designed with such limitations in mind. An Atari video game programmed for space aliens capable of learning from 10,000 different important feature combinations would be another matter entirely. However, in situations where most of the game features are just decoration and don’t have any significance for winning and losing, a deep neural network can reduce the complexity to something manageable by a reinforcement learning algorithm.
What is true of Atari is also true of board games like Go. This ancient Chinese pastime was thought to be beyond mastery by computers due to the inherent complexity of the game. As the Go experts were fond of reminding us, there are more possible board combinations in a game of Go than the number of quarks that have ever existed in the universe since the beginning of time. But as in Atari video games, many of the board positions in a Go game are not pertinent to play at any given turn of the game. They are like that pixel in the far corner of the screen that isn’t important until it indicates that an enemy is headed your way. “Deep Reinforcement learning”, that is, the combination of deep neural networks and reinforcement learning, proved just as effective at mastering Go as it did at the Attari video games. In 2017, AlphaZero, a Go playing deep reinforcement learning algorithm developed by DeepMind defeated several of the world’s leading human Go players as well the best Go artificial intelligence.
One of the key fallacies that occurs when thinking of a game like Go is the assumption that complex games require a complex type learning. In fractal geometry, bewildering patterns of seemingly infinite complexity can be derived from simple formulas. Evolution, which has produced myriad life forms of life is guided by an equally simple learning rule — the mistake. In truth, the same learning equation that allows for the mastery of Tic-Tac-Toe can produce mastery of a game like Go. In both games, reinforcement learning can discover the key associations that are pertinent to winning. Which isn’t to say there are not more complex ways to address teaching computers to master games. DeepBlue, the IBM supercomputer that defeated Gary Kasparov at chess in 1997, was a gargantuan program with thousands of hand coded scenarios built into it by chess experts and programmers. But such complex programs, in the end, are far less robust and powerful than a simple algorithm like reinforcement learning. For one, they weave in the experiential bias of the humans who coded them. When the Atari deep reinforcement learning algorithm was developed at Deepmind, it discovered a way to rack up points in the game of Pong using a trick that was previously unknown to human players. If it was programed from solely human experience, it would likely never produce such “alien” moves. The strength of reinforcement learning is that, playing against itself, it can try out millions of moves that nobody in the history of the game has ever thought to try. This can also make it difficult to calibrate. If ones aim is to create a video game AI with a human level of expertise, it may involve stopping the training early before the agents reaches a super human level of play.
Many so called “experts” looked at AlphaZero, the chess playing reinforcement learning algorithm, and saw a more advanced version of DeepBlue and thus failed to realize that they were looking at a completely different kind of AI, one with far different implications. Because reinforcement learning mimics one of the ways in which humans learn, the same algorithm that can be used to master Go can be used to master cooking an omelet or folding the laundry. When you first start learning to fold the laundry, you make mistakes, the sleeves don’t line up, your creases lack precision. Through repetition, or in the words of computer science, iteration, you slowly learn the correct moves necessary to get you to the goal state, the perfectly folded shirt. In such a manner, many human activities can be “gamified” and turned into reinforcement learning problems. This is both the promise and the peril of reinforcement learning. DeepBlue, the chess playing supercomputer could only ever succeed at chess whereas a reinforcement learning algorithm can readily be adapted to any task that can be gamified.
In the next article, we will walk through a demo of how to create your own custom RL agent using modern game development software.