Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org

Missing
DCU student

CASE3 student Paul Bunbury is missing since Thur 2 Feb 2012.
See appeals on crime.ie and garda.ie and facebook.

He is a great coder. See DCU page and boards.ie page.
He won major coding contests in 2010 and 2011.
He is author of the brilliant "FloodItWorld".
DCU can confirm that in Jan 2012 he passed all 6 modules comfortably.


Help on displaying equations


Mark Humphrys - Research - PhD - Appendix C - Appendix D



D 3-reward (or more) reward functions

For 3-reward (or more) agents the relative sizes of the rewards do matter for the Q-learning policy. Consider an agent of the form:
  
tex2html_wrap_inline6828 	reward: if (best event)  tex2html_wrap_inline8468  else if (good event)  tex2html_wrap_inline8230  else  tex2html_wrap_inline9546  
where tex2html_wrap_inline9548 .



D.1 Policy in Q-learning

We show by an example that changing one reward in this agent while keeping others fixed can lead to a switch of policy. Imagine that currently actions a and b lead to the following sequences of rewards:

 (x,a) leads to sequence  tex2html_wrap_inline9556  and then  tex2html_wrap_inline9558  forever

(x,b) leads to sequence tex2html_wrap_inline9562 and then tex2html_wrap_inline9558 forever

Currently action b is the best. We lose tex2html_wrap_inline9568 on the fifth step certainly, but we make up for it by receiving the payoff from tex2html_wrap_inline9570 in the first four steps. However, if we start to increase the size of tex2html_wrap_inline8468 , while keeping tex2html_wrap_inline8230 and tex2html_wrap_inline9546 the same, we can eventually make action a the most profitable path to follow and cause a switch in policy.



D.2 Strength in W-learning

Because increasing the gaps between rewards may switch policy, we can't say that in general it will increase W-values. In the example above, say the leader tex2html_wrap_inline7044 was suggesting (and executing) action a all along. By increasing the gaps between our rewards, we suddenly want to take action a ourself, so tex2html_wrap_inline7226 .

Increasing the difference between its rewards may cause tex2html_wrap_inline6828 to have new disagreements, and maybe new agreements, with the other agents about what action to take, so the progress of the W-competition may be radically different. Once a W-value changes, we have to follow the whole re-organisation to its conclusion.

What we can say is that multiplying all rewards by the same constant (see §D.4 shortly), and hence multiplying all Q-values by the constant, will increase or decrease the size of all W-values without changing the policy.



D.3 Normalisation

Any agent with rewards tex2html_wrap_inline9590 can be normalised to one with rewards tex2html_wrap_inline9592 . The original agent can be viewed as a normalised one which also picks up tex2html_wrap_inline9594 every timestep no matter what.

The normalised agent will have the same policy and the same W-values.



D.4 Exaggeration

theorem3549

Think of it as changing the "unit of measurement" of the rewards.

Proof: When we take action a in state x, let tex2html_wrap_inline9614 be the probability that reward tex2html_wrap_inline6988 is given to tex2html_wrap_inline7368 (and therefore that reward tex2html_wrap_inline9620 is given to tex2html_wrap_inline7366 ). Then tex2html_wrap_inline7366 's expected reward is simply c times tex2html_wrap_inline7368 's expected reward:

displaymath9597

It follows from the definitions in §2.1 that tex2html_wrap_inline9630 and tex2html_wrap_inline9632. tex2html_wrap_inline7352

I should note this only works if: c > 0


tex2html_wrap_inline7366 will have the same policy as tex2html_wrap_inline7368 , but larger or smaller W-values.



Appendix E

Return to Contents page.



Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.