School of Computing. Dublin City University. Home Blog Teaching Research Contact Online coding site: Ancient Brain Free course: Online AI programming exercises 


reward: if (good event) r else swhere r > s.
Proof: Let us fix r and s and learn the Qvalues. In a deterministic world, given a state x, the Qvalue for action a will be:
for some real numbers . The Qvalue for a different action b will be:
where . That is, e + f = c + d .
So whichever one of c and e is bigger defines which is the best action (which gets the larger amount of the "good" reward r), irrespective of the sizes of r > s.
To be precise, if c > e, then Q(x,a) > Q(x,b)
Note that these numbers are not integers  it may not be simply a question of the worse action receiving s instead of r a finite number of times. The worse action may also receive r instead of s at some points, and also the number of differences may in fact not be finite.
To be precise, noting that (ce) = (fd) , the difference between the Qvalues is:
where the real number (ce) is constant for the given two actions a and b in state x. (ce) depends only on the probabilities of events happening, not on the specific values of the rewards r and s that we hand out when they do. Changing the relative sizes of the rewards r > s can only change the magnitude of the difference between the Qvalues, but not the sign. The ranking of actions will stay the same.
For example, an agent with rewards (10,9) and an agent with rewards (10,0) will have different Qvalues but will still suggest the same optimal action .
In a probabilistic world, we would have:
where p + q = 1 , and:
for some p' + q' = 1 .
I think this should just read:
E(r_{t+1})  = Σ _{y} P_{xa}(y) r(x,y) 
= P_{xa}(y_{1}) r(x,y_{1}) + ... + P_{xa}(y_{n}) r(x,y_{n})  
= p' r + q' s 
Hence:
for some as before.
Proof: From the proof of Theorem C.1:
where is a constant independent of the particular rewards.
Using our "deviation" definition, for the 2reward agent in a deterministic world:
The size of the Wvalue that presents in state x if is the leader is simply proportional to the difference between its rewards. If wants to take the same action as , then (that is, (ce) = 0). If the leader switches to , the constant switches to .
Increasing the difference between its rewards will cause to have the same disagreements with the other agents about what action to take, but higher values  that is, an increased ability to compete. So the progress of the Wcompetition will be different.
For example, an agent with rewards (8,5) will be stronger (will have higher Wvalues and win more competitions) than an agent with the same logic and rewards (2,0). And an agent with rewards (2,0) will be stronger than one with rewards (10,9). In particular, the strongest possible 2reward agent is:
reward: if (good event) else
reward: if (good event) (rs) else 0From Theorem C.1, this will have different Qvalues but the same Qlearning policy. And from Theorem C.2, it will have identical Wvalues. You can regard the original agent as an (rs), 0 agent which also picks up an automatic bonus of s every step no matter what it does. Its Qvalues can be obtained by simply adding the following to each of the Qvalues of the (rs), 0 agent:
We are shifting the same contour up and down the yaxis in Figure 8.1.
The same suggested action and the same Wvalues means that for the purposes of Wlearning it is the same agent. For example, an agent with rewards (1.5,1.1) is identical in Wlearning to one with rewards (0.4,0). The W=Q method would treat them differently.
reward: if (good event) r else 0where r > 0 .
Proof: We have just multiplied all rewards by c, so all Qvalues are multiplied by c. If this is not clear, see the general proof Theorem D.1.
I should note this only works if: c > 0
will have the same policy as ,
but different Wvalues.
We are exaggerating or levelling out the contour
in Figure 8.1.
In particular, the strongest possible normalised agent is:
reward: if (good event) else 0
Return to Contents page.