This blog post will demonstrate how deep reinforcement learning (deep q learning) can be implemented and applied to play a CartPole game using Keras and Gym, in only 78 lines of code!
I’ll explain everything without requiring any prerequisite knowledge about reinforcement learning.
The code used for this article is on GitHub.
Reinforcement Learning is a type of machine learning that allows you to create AI agents that learn from the environment by interacting with it. Just like how we learn to ride a bicycle, this kind of AI learns by trial and error. As seen in the picture, the brain represents the AI agent, which acts on the environment. After each action, the agent receives the feedback. The feedback consists of the reward and next state of the environment. Reward is usually defined by a human. If we use the analogy of the bicycle, we can define reward as the distance from the original starting point.
Google’s DeepMind published its famous paper Playing Atari with Deep Reinforcement Learning, in which they introduced a new algorithm called Deep Q Network (DQN for short) in 2013. It demonstrated how an AI agent can learn to play games by just observing the screen without any prior information about those games. The result turned out to be pretty impressive. This paper opened the era of what is called ‘deep reinforcement learning’, a mix of deep learing and reinforcement learning.
In Q Learning Algorithm, there is a function called Q Function, which is used to approximate the reward based on a state. Similarly in Deep Q Network algorithm, we use a neural network to approximate the reward based on the state. We will discuss how this works in detail.
Usually, training an agent to play an Atari game takes a while (from few hours to a day). So we will make an agent to play a simpler game called CartPole, but using the same idea used in the paper.
CartPole is one of the simplest environments in OpenAI gym (a game simulator). As you can see in the animation from the top, the goal of CartPole is to balance a pole connected with one joint on top of a moving cart. Instead of pixel information, there are 4 kinds of information given by the state, such as angle of the pole and position of the cart. An agent can move the cart by performing a series of actions of 0 or 1 to the cart, pushing it left or right.
Gym makes interacting with the game environment really simple.
As we discussed above, action can be either 0 or 1. If we pass those numbers,
env, which represents the game environment, will emit the results.
done is a boolean value telling whether the game ended or not. The old
stateinformation paired with
reward is the information we need for training the agent.
This post is not about deep learning or neural net. So we will consider neural net as just a black box algorithm. An algorithm that learns on the pairs of example input and output data, detects some kind of patterns, and predicts the output based on an unseen input data. But we should understand which part is the neural net in the DQN algorithm.
Note that the neural net we are going to use is similar to the diagram above. We will have one input layer that receives 4 information and 3 hidden layers. But we are going to have 2 nodes in the output layer since there are two buttons (0 and 1) for the game.
Keras makes it really simple to implement basic neural network. The code below creates an empty neural net model.
optimizer are the parameters that define the characteristics of the neural network, but we are not going to discuss it here.
In order for a neural net to understand and predict based on the environment data, we have to feed it the information.
fit() method feeds
target_f information to the model, which I explain below. You can ignore the rest parameters.
This training process makes the neural net to predict the reward value (
target_f) from a certain
When you call
predict() function on the model, the model will predict the reward of current state based on the data you trained.
The most notable features of the DQN algorithm are remember and replay methods. Both are pretty simple concepts.
One of the challenges for DQN is that neural network used in the algorithm tends to forget the previous experiences as it overwrites them with new experiences. So we need an list of previous experiences and observations to re-train the model with the previous experiences. We will call this array of experiences
memory and use
remember() function to append state, action, reward, and next state to the memory.
In our example, the memory list will have a form of:
And remember function will simply store states, actions and resulting rewards to the memory like below:
done is just a boolean that indicates if the state is the final state.
A method that trains the neural net with experiences in the
memory is called
replay(). First, we sample some experiences from the
memory and call them
The above code will make
batches a randomly sampled indexes of the memories. For example, if batches is [1,5,2,7], each number represents the indexes of the memory 1, 5, 2, and 7.
To make the agent perform well in long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. In order to do this, we are going to have a ‘discount rate’ or ‘gamma’. This way the agent will learn to maximize the discounted future reward based on the given state.
Our agent will randomly select its action at first by certain percentage, called ‘exploration rate’ or ‘epsilon’. This is because at first, it is better for the agent to try all kinds of things before it starts to see the patterns. When it is not deciding the action randomly, the agent will predict the reward value based on the current state, and pick the action that will give the highest reward.
np.argmax() is the function that picks the highest value between two elements in the
act_values looks like this: [0.67, 0.2], each numbers representing the reward of picking action 0 and 1. And argmax function picks the index with the highest value. In the example of [0.67, 0.2], argmax returns 0 because the value in the 0th index is the highest.
There are some parameters that has to be passed to an reinforcement learning agent. You will see these over and over again.
episodes- number of games we want the agent to play.
gamma- aka decay or discount rate, to calculate the future discounted reward.
epsilon- aka exploration rate, this is the rate in which an agent randomly decides its action rather than prediction.
epsilon_decay- we want to decrease the number of explorations as it gets good at playing games.
epsilon_min- we want the agent to explore at least this amount.
learning_rate- Determines how much neural net learns in each iteration.
I explained each parts of the agent in the above. The code below implements everything we’ve talked about as a nice and clean class called
The training part is even shorter. I’ll explain in the comments.
In the beginning, the agent explores by acting randomly.
It goes through multiple phases of learning.
- The cart masters balancing the pole.
- But goes out of bounds, ending the game.
- It tries to move away from the bounds when it is too close to them, but drops the pole.
- The cart masters balancing and controlling the pole.
After several hundreds of episodes (took 10 min), it starts to learn how to maximize the score.
The final result is the birth of a skillful CartPole game player!
The code used for this article is on GitHub.. I added the saved weights for those who want to skip the training part.