# Reward

* A reward ![R_t](https://latex.codecogs.com/svg.latex?R_t) is a scalar feedback signal. Often, ![R_t \in \mathbb{R}](https://latex.codecogs.com/svg.latex?R_t%20\in%20\mathbb{R}) or ![R_t \in \mathbb{Z}](https://latex.codecogs.com/svg.latex?R_t%20\in%20\mathbb{Z}).
* Indicates how well agent is doing at step ![t](https://latex.codecogs.com/svg.latex?t).
* The agent's job is to maximize cumulative reward.

Reinforcement Learning is based on the <b>reward hypothesis</b>

```{admonition} Reward Hypothesis
<i>All goals can be described by the maximisation of expected cumulative reward.</i>
```{math}
max \, \mathbb{E} \left[ \sum_{i = 0}^{\infty} R_{t+i+1} \right]
```

Our **goal** is to sequentially perform actions which maximize expected cumulative reward.

However,
* Any action may have long-term consequences.
* Reward may be delayed.
* It may be better to sacrifice immediate reward to gain greater long-term reward.

```{figure} Images/RL_example_mouse_cheese.jpg
---
width: 80%
align: center
name: rl_eg_mouse_cheese
---
Short-term vs Long-term reward trade-off
```

Examples:
1. A financial investment (may take months to mature).
2. Refueling a helicopter (may prevent future crash).
3. Blocking opponent moves in chess (may improve chances of winning).

```{note}
<i>A RL agent's learning process is <strong>heavily linked with the reward distribution over time</strong>, however, there is no predefined way on how to design the <strong>best reward function</strong>.</i>
```

```{admonition} Tip: <i>Be careful what you wish for, for you might get it</i>
The reward function learns towards a policy it was asked for, not what should have been asked for nor what was intended.
```

## Discounted return 
Discounted return (![G_t](https://latex.codecogs.com/svg.latex?G_t)) is the cumulative reward defined as follows
```{math}
G_t = \sum_{i = 0}^{\infty} \gamma^i \cdot R_{t+i+1}, \quad \text{where } \gamma \in \left[ 0, 1 \right]
```
* ![\gamma \longrightarrow Discount rate](https://latex.codecogs.com/svg.latex?\gamma\longrightarrow) Discount rate
* Larger ![\gamma \implies](https://latex.codecogs.com/svg.latex?\gamma\implies) Smaller discount. <b>Agent cares more about long-term reward</b>.
* Smaller ![\gamma \implies](https://latex.codecogs.com/svg.latex?\gamma\implies) Larger discount. <b>Agent cares more about short-term reward</b>.

## Task-dependent discounting
<strong><em>Episodic Task</em></strong>:
* Tasks that have a **terminal state**.
* Problem naturally breaks into **episodes**.
* The **return** becomes a finite sum.

<strong><em>Continuing Task</em></strong>:
* Tasks that have **no terminal state** but can go on infinitely until stopped.
* Problem lacks a natural end.
* The **return** should be discounted to prevent absurdly large numbers.

## Two ways of calculating return
<ol>
<li> <strong><em>Batch Learning</em></strong>

```{math}
G_t = \sum_{i = 0}^{\infty} \gamma^i \cdot R_{t+i+1}
```

<li> <strong><em>Online Learning</em></strong>

```{math}
G_{t} = R_{t+1} + G_{t+1}
```
</ol>