School of Computing. Dublin City University. Home Blog Teaching Research Contact My big idea: Ancient Brain 


reward: if (best event) else if (good event) elsewhere .
(x,a) leads to sequence and then forever(x,b) leads to sequence and then forever
Currently action b is the best. We lose on the fifth step certainly, but we make up for it by receiving the payoff from in the first four steps. However, if we start to increase the size of , while keeping and the same, we can eventually make action a the most profitable path to follow and cause a switch in policy.
Increasing the difference between its rewards may cause to have new disagreements, and maybe new agreements, with the other agents about what action to take, so the progress of the Wcompetition may be radically different. Once a Wvalue changes, we have to follow the whole reorganisation to its conclusion.
What we can say is that multiplying all rewards by the same constant (see §D.4 shortly), and hence multiplying all Qvalues by the constant, will increase or decrease the size of all Wvalues without changing the policy.
The normalised agent will have the same policy and the same Wvalues.
Think of it as changing the "unit of measurement" of the rewards.
Proof: When we take action a in state x, let be the probability that reward is given to (and therefore that reward is given to ). Then 's expected reward is simply c times 's expected reward:
It follows from the definitions in §2.1 that and .
I should note this only works if: c > 0
will have the same policy as ,
but larger or smaller Wvalues.
Return to Contents page.