Introduction to Reinforcement Learning (Part 03: Q-Learning in Python)
In the previous posts, we have learned the Basic Concepts and Key Terminologies of reinforcement learning. In this post, we will learn how Q-learning works and how to implement in Python.
Q-learning is a value-based Reinforcement learning where Q-values are used to determine action in a particular state. The Q-values are updated iteratively and the maximum Q-value defines the action-policy at each state.
Q-value or Action-value
Q-values is the action-state value and the goal is to maximize the $Q(s,a)$
\[U(s) = \max_a Q(s,a)\]Now, action/exploration may be required to discover the environment and thus learn the $Q(s,a)$
At each iteration, the Q-value is updated as follows:
\[Q(s,a) \leftarrow r(s,a) + \gamma \cdot max_{a^\prime} Q(s^\prime, a^\prime)\]here,
- $r(s,a) \leftarrow$ immediate reward
- $\gamma \leftarrow$ discount factor
- $s^\prime \leftarrow$ next state after taking action $a$
Now, after a number of iteration, the selected action is
\[\pi(s) = {\arg\max}_a Q(s,a)\]The algorithm looks like:
while True: - select an action $a$ - receive the immediate reward, $r$ - observe new state $s^\prime$ - update $Q(s,a) = r(s,a) + \gamma \cdot max_{a^\prime} Q(s^\prime, a^\prime)$
### Temporal-Difference Learning Now, to learn from the environment, an agent will follow:
\[Q^{new}(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \cdot \Big ( r_t + \gamma \cdot max_a Q(s_{t+1},a) -Q(s_t,a_t) \Big)\]You may have seen the same equation in a different form in a paper as follow:s \(Q^{new}(s_t,a_t) \leftarrow \alpha \cdot \Big ( r_t + \gamma \cdot max_a Q(s_{t+1},a) \Big) + (1-\alpha) \cdot Q(s_t,a_t)\)
where
- $\alpha \in (0,1] \rightarrow$ learning rate
- $(1-\alpha) \cdot Q(s_t,a_t) \rightarrow$ current Q-value weighted by learning rate
- $\alpha \cdot r_t = \alpha \cdot r(s_t,a_t) \rightarrow$ reward for action $a_t$ taken at state $s_t$
- $\gamma \in (0,1]$ discount factor for future rewards
also, a few terminologies of temporal difference
(TD Learning) to keep in mind,
- TD target $\leftarrow r_t + \gamma \cdot max_a Q(s_{t+1},a)$
- TD Delta $\leftarrow$ TD target $-$ Q(s_t,a_t)
- $Q^{new}(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \cdot$ TD Delta
Choosing Action based on $\epsilon-$greedy
policy
The policy refers to choosing aactions using the current Q-value estimates
- $1-\epsilon$ proability to choose action that has the highest Q-value
- $\epsilon$ probability to choose a random action
$\epsilon-$decay
Exploration is important only when the agent starts interacting with the environment. Over time, the agent perceives more information about the environment and need to focus more on exploitation. $\epsilon \approx 0$ results in greater exploitation and less exploration. That’s why we will gradually move towards zero for the $\epsilon$. This is called $\epsilon-$decay.
Q-learning Algorithm in Python
Environment
In this post, we will not cover how to build an environment in python. We will look at just some basics of what functions are available from an environment.
If you want to use OpenAI gym
environments, you can simply create an object of the environment. For example,
FrozenLake
gameimport gym env = gym.make("FrozenLake-v0")
CartPole
gameimport gym env = gym.make('CartPole-v0')
Now, we can check the length of
observation space
andaction space
as followsstate_space_size = env.observation_space.n action_space_size = env.action_space.n
Note that, if the state_space_size
and action_space_size
are small in number, we can use simple Q-learning
algorithm. If the size of state and action spaces are large, we should use DQN
aka Deep Q-Network. I will cover DQN
in a later post.
Training Q-agent
import numpy as np
import random
# define the initial Q-table (all zeroes)
q_table = np.zeros([state_space_size, action_space_size])
# Hyperparameters
episodes = 1000
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# storing information for graph
reward_per_episode = []
for _ in range(episodes):
state = env.reset()
done = False
total_reward = 0
while not done:
# epsilon-greedy algorithm for selecting action
# exploration
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
# exploitation
else:
action = np.argmax(q_table[state])
next_state, reward, done, info = env.step(action)
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
q_table[state, action] = new_value
# calculate total reward
total_reward += reward
# epsilon decay (could vary, just example)
epsilon = DECAY_EPSILON(epsilon)
# append total reward per episode to the list
reward_per_episode.append(total_reward)
here, the DECAY_EPSILON()
function can vary based on the coder. A simple decay could be as follows:
MIN_EPSILON = 0.0
def DECAY_EPSILON(epsilon, episode, MIN_EPSILON):
delta = (epsilon - MIN_EPSILON)/episode
epsilon = epsilon - delta
return epsilon
or define the rate and then do the following
DECAY_RATE = 0.0005
def DECAY_EPSILON(epsilon, DECAY_RATE):
epsilon = epsilon * (1 - DECAY_RATE)
return epsilon
Evaluating Q-Agent
The testing agent does not need the exploration
, therefore, requiring only the exploitation at each state.
# Hyperparameters
episodes = 100
for _ in range(episodes):
state = env.reset()
done = False
total_reward = 0
while not done:
action = np.argmax(q_table[state])
next_state, reward, done, info = env.step(action)
# calculate total reward
total_reward += reward
# append total reward per episode to the list
reward_per_episode.append(total_reward)
Now, we can check whether we are getting the expected rewards per episode.
Leave a comment