# What is Reinforcement Learning?

## In its literal sense
<strong><em>Reinforcement</em></strong>: It is defined as a consequence that follows a behavior that increases (or attempts to increase) the likelihood of that response occurring in the future<small>[[source](https://psychology.uiowa.edu/comparative-cognition-laboratory/glossary/reinforcement)]</small>. 

There are two types of reinforcement, known as positive reinforcement and negative reinforcement; positive is where a reward is offered on expression of the wanted behaviour and negative is taking away an undesirable element (or giving penalty), whenever the desired behaviour is achieved.

<strong><em>Learning</em></strong>: Acquiring knowledge and skills and having them readily available from memory so you can make sense of future problems and opportunities<small>[[source](https://pdfs.semanticscholar.org/4fc5/9769aa78266810ba1add6f0ed3e730e8efa6.pdf)]</small>.

## RL within ML paradigm
> "A computer program is said to learn from <b>experience</b> <i>E</i> with respect to some <b>task</b> <i>T</i> and some <b>performance</b> <i>P</i>, if its performance on <i>T</i>, as measured by <i>P</i>, improves with experience <i>E</i>."<p align = 'right'>- Mitchell 1997</p>

```{figure} Images/Types_of_ML.jpeg
---
width: 80%
align: center
name: types_of_ml
---
Broad types of ML
```

Fitting RL into this definition gives,

1. **Task** (T) ![\Large x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}](https://latex.codecogs.com/svg.latex?\Large&space;\longrightarrow) Decision making strategy
2. **Performance** (P) ![\Large x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}](https://latex.codecogs.com/svg.latex?\Large&space;\longrightarrow) Cumulative rewards
3. **Experience** (E) ![\Large x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}](https://latex.codecogs.com/svg.latex?\Large&space;\longrightarrow) Interaction with environment/system

## The fundamental block of RL
> Reinforcement Learning (RL) is the science of decision making for a pre-defined environment - mapping situations to actions - maximizing a numerical reward signal.

```{figure} Images/RL_Model.png
---
width: 80%
align: center
name: rl_model
---
A fundamental RL block
```

A typical formulation as follows can be considered,

1. Observe current state of environment, ![S_t](https://latex.codecogs.com/svg.latex?S_t), and current reward ![R_t](https://latex.codecogs.com/svg.latex?R_t)
2. Agent decides on the action, ![A_t](https://latex.codecogs.com/svg.latex?A_t)
3. Agent performs action, through interaction with environment
4. Environment shifts into the new state, ![S_t+1](https://latex.codecogs.com/svg.latex?S_{t+1})
5. Reward feedback, ![R_t+1](https://latex.codecogs.com/svg.latex?R_{t+1})

## Pacman as an RL example
```{figure} Images/RL_example_pacman.png
---
width: 80%
align: center
name: rl_pacman
---
Pacman as an RL example
```

<b>Pacman</b> is one of the most commonly known arcade games, which almost everyone has played at least once. Using it as an RL example, 

1. <b>State Space</b> ![S](https://latex.codecogs.com/svg.latex?S): All possible configurations in the game, including the positions of the points, positions of the ghosts, and positions of the player, etc.

2. <b>Action Space</b> ![A](https://latex.codecogs.com/svg.latex?A): A set of all <i>allowed actions</i> {<em>left</em>, <em>right</em>, <em>up</em>, <em>down</em>, or <em>stay</em>}.

3. <b>Policy</b> ![$\pi$ : $S \longrightarrow A$](https://latex.codecogs.com/svg.latex?\pi%20\text{:}S%20\longrightarrow%20A): A mapping from the state space to the action space.

The objective of the player is to take a series of optimal actions to maximize the total reward that it could obtain at the end of the game.<small>[[source](http://cs229.stanford.edu/proj2017/final-reports/5241109.pdf)]</small>

## Using RL on OpenAI Gym Atari
```{figure} Images/RL_atari.gif
---
width: 80%
align: center
name: rl_atari
---
RL on Atari environments
```

RL algorithms can be used on different types of atari game environments, where the environment definitions, player objectives, allowable actions, etc are all distinct.

## RL Agent learns to walk
```{figure} Images/RL_Learn_to_walk.gif
---
width: 80%
align: center
name: rl_learn_to_walk
---
RL Agent learns to walk
```

Against simple decision making of an arcade game, RL agents can learn even highly complex tasks, like bipedal walking, though it might take much longer to train.

## Many Faces of RL
RL finds application at the intersection of various disciplines and fields as illustrated below,

```{figure} Images/many_faces_of_RL.PNG
---
width: 80%
align: center
name: rl_many_faces
---
Many Faces of RL
```

## Distinctive Characteristics of RL
What makes RL different from other ML paradims?

* Not strictly <i>supervised</i> as instead of pre-labelled data, the  <strong><em>data is generated through interaction with environment</em></strong>.
* Unlike <i>unsupervised</i>, the goal isn't finding relationships within data, but <strong><em>maximize some reward function</em></strong>.
* Reward feedback determines the quality of performance.
* Time is a key component, as process is <strong><em>sequential (non i.i.d data) with delayed feedback</em></strong>.
* Each action agent makes affects the subsequent data it receives.

## Summarizing
* RL is a science of **learning to make decisions** from interaction with environment.
* Solving any problem through RL makes us think about,
    * time
    * long-term action consequences
    * effective and efficient gathering of experience
    * predicting the future states
    * dealing with uncertainty