Introduction to Reinforcement Learning (Part 01: Basic Concepts)
Reinforcement learning (RL) has been quite a popular field in both academia and industries as it helps to build intelligent agents and can solve automationbased problems.
In this tutorial, we will step towards the very basic concepts of RL. I plan to continue the tutorial as a series where we will learn more about theories and implementation of different algorithms in Python
. So, stay tuned.
Reinforcement Learning (RL)
RL defines the learning process of an intelligent agent that perceives environment states, learns to choose an action in that state that leads to the maximum cumulative reward at the end of a game.
In python, it is like:
obs = env.reset()
done = False
while not done:
action = agent.get_action(obs)
next_obs, reward, done, info = env.step(action)
obs = next_obs
Reinforcement Learning Terminologies
Basic Terminologies

Environment – is a typical physical world (could be a game or gamealike problems) where an agent or player learns to choose particular actions at each state of the game. Example: A Chess game.
Markov Decision Process (MDP)
is typically used to define an environemt. A MDP is represented as a 4tuple ($S,A,P_a,R_a$), where $S$ is a set of states, $A$ is a set of actions, $P_a \big( s, s^{\prime} \big) = Pr \big( s_{t+1} = s^{\prime} \vert s_t = s, a_t = a \big)$ is the probability of reaching to state $s^{\prime}$ if an action $a$ is taken at state $s$, and $R_a \big( s, s^{\prime} \big)$ is the immediate reward. 
Agent – is a learner whose target is to maximize the cumulative reward at each time step of a game. The agent finds a policy to understand the best action to take given a particular state of the environment. Example: each player in a Chess game is defined as the agent whose target is to win the game with best possible combination of moves.

Action ($a$) – a list of actions that an agent can perform at each state of the environment. Example: At the beginning state of Chess, a player can use any of his pawns to go forward, or any of the Knights.

State ($s$) – the present condition of the agent/player in the environment. Example: The beginning condition of a Chess board where a player is yet to take an action.

Reward ($r$) – for each action taken by an agent, the environment outputs a reward. It’s usually a scalar value and nothing but feedback from the environment
Additional Important Terminologies

Discount Factor ($\gamma$)  In a RL problem, the agent tries to maximize the cumulative reward at each time step $t$,
$total\ reward = \sum_{k=0}^T R_{t+k+1}$
However, not all rewards are equally important, for example the distant future rewards. In that case, we discount the future rewards by multiplying the rewards with a discount factor $\gamma \in [0,1)$. Therefore, our cumulative expected (discounted) rewards is
$total\ reward = \sum_{k=0}^\infty \big[\gamma^k \cdot R_{t+k+1} \big]$ $= R_{t+1} + \gamma \cdot R_{t+2} + \gamma^2 \cdot R_{t+3} \dots$

Policy ($\pi$) – defines the action strategy at a particular state (the current state). For a deterministic policy, the action to take at a particular state is the policy. If Stochastic, it outputs a probability of an action. We will see more details later. It can be stochastic, $a_t \sim \pi(\centerdot \vert s_t)$ or deterministic, $a_t = \pi(s_t)$

Value ($V$)  The expected longterm return with discount, as opposed to the shortterm reward $R$. Here, the $V_\pi(s)$, is defined as the expected longterm return of the current state $s$ under policy $\pi$.

Qvalue or Actionvalue ($Q$)  Qvalue is similar to the Value Function, except that it takes an extra parameter, the current action $a$. Here, the Qvalue function, $Q_\pi(s, a)$ refers to the longterm return of the current state $s$, taking action $a$ under policy $\pi$.

Trajectory (sometimes called as Episodes, $\tau$)  A sequence of state, action, and rewards, e.g., $\tau \rightarrow (s_2, a_2, r_3,s_3,a_3,r_4,s_4)$ that influence those environment states
 The initial state $s_0$ is sampled over initial distribution $\mu$ \(s_0 \sim \mu(\centerdot)\)
 deterministic state transition, $s_{t+1} = f(s_t,a_t)$
 stochastic state transition, $s_{t+1} \sim Pr(\centerdot \vert s_t,a_t)$
Final Objective of RL
Our target is to make the agent learn the best policy ($\pi^*$) that maximizes the expected cumulative reward
\[\pi^* = \arg \max_\pi E_{\tau \sim \pi} \big[ R(\tau) \big]\]where, $\tau \sim \pi$ means
 $s_0 \sim \mu(\centerdot)$
 $a_t \sim \pi(\centerdot \vert s_t)$
 $s_{t+1} \sim Pr(\centerdot \vert s_t,a_t)$
How RL works
Based on the objectives, RL takes different approcahes to solve a particular problem. Here, we will get introduced to the primary approaches to solve RL problems.

Valuebased Approach  in this approach, an agent tries to maximize a value function $V(s)$, which is the value of the cumulative reward which an agent expects to gain in the future.
\[V_\pi(s) = E_\pi \big[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...S_t = s \big]\] 
Policybased  in a policybased RL approach, an agent tries to build a policy that outputs the optimal action performed at each state to gain maximum reward in the future. Here, the policy $π$ determines the next action $a$ at any state $s$ and can be represented as $a = \pi(s)$. There are two types of policybased RL methods 

Deterministic  at the current state $s$, the policy $π$ outputs the action $a$ to take.

Stochastic  each action has a certain probability, given by the equation below 
\[\pi (as) = Pr \big(A_t=aS_t=s \big)\]


Modelbased  this approach requires an additional model of the environment.
Temporal Difference Learning
TDlearning is a type of modelfree reinforcement learning method (not modelbased) by performing randomsampling of the current state of the value function.
$V(S_t) \leftarrow V(S_t) + \alpha \big[ R_{t+1} +\gamma \cdot V(S_{t+1})  V(S_t) \big]$
Here,
 $V(S_t) \rightarrow$ is the previous estimate
 $\alpha \rightarrow$ learning rate
 $R_{t+1} \rightarrow$ reward at the next state
 $\gamma \cdot V(S_{t+1}) \rightarrow$ discounted value at the next step
 $R_{t+1} + \gamma \cdot V(S_{t+1}) \rightarrow$ TD Target
Workflow
 Problem Formulation and Model Buildup
 Create an Environment based on model
 Define Actions and Observations for the agent(s)
 Define the Reward function of the agent(s)
 Create the agent(s)
 Train and validate the agent(s)
 Deploy the policy (policies in multiagent games)
Reading/Watch Lists and Resources
 Awesome Reinforcement Learning
 Awesome Deep Reinforcement Learning
 Course in Deep Reinforcement Learning
 Deep RL Bootcamp
 Coursera  Practical Reinforcement Learning
 Udacity  Reinforcement Learning
Applications
 Industrial Robot Automation
 Video Games
 Self Driving Cars
 Drone Shipping
 Bots
 Cyber Security
In the next tutorial, we will learn some additional concepts and other glossaries that will strengthen our base to learn about RL.
Leave a Comment