Dr. Mark Humphrys School of Computing. Dublin City University. My big idea: Ancient Brain Search:

Help on displaying equations

```
```

```

```

# D 3-reward (or more) reward functions

For 3-reward (or more) agents the relative sizes of the rewards do matter for the Q-learning policy. Consider an agent of the form:
```
reward: if (best event)    else if (good event)    else
```
where .

# D.1 Policy in Q-learning

We show by an example that changing one reward in this agent while keeping others fixed can lead to a switch of policy. Imagine that currently actions a and b lead to the following sequences of rewards:

``` (x,a) leads to sequence    and then    forever

(x,b) leads to sequence    and then    forever

```

Currently action b is the best. We lose on the fifth step certainly, but we make up for it by receiving the payoff from in the first four steps. However, if we start to increase the size of , while keeping and the same, we can eventually make action a the most profitable path to follow and cause a switch in policy.

# D.2 Strength in W-learning

Because increasing the gaps between rewards may switch policy, we can't say that in general it will increase W-values. In the example above, say the leader was suggesting (and executing) action a all along. By increasing the gaps between our rewards, we suddenly want to take action a ourself, so .

Increasing the difference between its rewards may cause to have new disagreements, and maybe new agreements, with the other agents about what action to take, so the progress of the W-competition may be radically different. Once a W-value changes, we have to follow the whole re-organisation to its conclusion.

What we can say is that multiplying all rewards by the same constant (see §D.4 shortly), and hence multiplying all Q-values by the constant, will increase or decrease the size of all W-values without changing the policy.

# D.3 Normalisation

Any agent with rewards can be normalised to one with rewards . The original agent can be viewed as a normalised one which also picks up every timestep no matter what.

The normalised agent will have the same policy and the same W-values.

# D.4 Exaggeration

Think of it as changing the "unit of measurement" of the rewards.

Proof: When we take action a in state x, let be the probability that reward is given to (and therefore that reward is given to ). Then 's expected reward is simply c times 's expected reward:

It follows from the definitions in §2.1 that and .

I should note this only works if: c > 0

will have the same policy as , but larger or smaller W-values.

```
```