Dr. Mark Humphrys School of Computing. Dublin City University. My big idea: Ancient Brain Search:

```
```

# What is wrong with γ = 1?

i.e. Just sum all future rewards.
Consider this situation:

```
```

### State-space is not geographic-space

Note state-space is not geographic-space.
Transitions could be one-way, not 2-way. No easy way back to the state you left forever.
e.g. In chess, state = your queen is captured.

or e.g. You are Hitler. It is 1941. You have conquered Europe and are unthreatened.
So you decide to invade the Soviet Union and declare war on the US (neither of whom were at war with you so far).
There is no way back ever to that state before invasion. You have gone through a one-way door.

```
```

# r=0, γ = anything

If take action a, expected long-term reward:
Ea(R) = 5 + γ 0 + γ2 0 + γ3 0 + ..
= 5

Eb(R) = 0 + γ 100 + γ2 0 + γ3 0 + ..
= 100 γ

1. If γ = 0, it picks a.

2. If γ = 1, it takes action b.

3. Exercise - What is the smallest γ for which it picks b?

```
```

# r=1, γ = anything

Imagine if the infinite loops scored 1 each time round instead of 0.

Ea(R) = 5 + γ + γ2 + γ3 + ..

Eb(R) = 0 + γ 100 + γ2 + γ3 + ..

1. If γ = 0, it still picks a.

2. If γ = 1
Ea(R) = 5 + 1 + 1 + ..
Eb(R) = 0 + 100 + 1 + 1 + ...
Both infinite, can't tell the difference, will pick either one. Not necessarily b.

Notes on infinity:
If γ = 1, then even if one infinite loop gives 1 forever, and the other gives 100 forever, it can't tell the difference. Both infinite. In fact, all the following are the same:

1 + 1 + 1 + 1 + ...
-1000 -1000 -1000 -1000 -1000 + 1 + 1 + 1 + ...
10 + 10 + 10 + 10 + ...
For any γ < 1 we do not have this problem.

3. Exercise - If γ < 1, for what γ will it pick b?

```
```

# r=anything, γ = anything

If the infinite loops give reward r each time round:

Ea(R) = 5 + γ r + γ2 r + γ3 r + ..
= 5 + γ r + r (γ2 + γ3 + .. )

Eb(R) = 0 + γ 100 + γ2 r + γ3 r + ..
= 100 γ + r (γ2 + γ3 + .. )

• If r is greater than some value it will always pick action a, no matter what γ is. What is that value?
e.g. Imagine if loops give r = 1000 each time. Then reward of 100 is useless - we want to get to the loop as soon as possible - pick action a.
Similarly, if r = 100 always pick a.

So what is the point at which we always pick a?

```
```
```
```
ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.