Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

My big idea: Ancient Brain


CA114      CA170

CA668      CA669      Projects


Recall we are trying to maximise long-term discounted reward.
Taking action a in state x yields long-term discounted reward:

Q(x,a) values

The Q-learning agent builds up Quality-values (Q-values) Q(x,a) for each pair (x,a).

We tend to take the action with the highest Q-value:

maxb Q(x,b)

The Q-value will express the discounted reward if you take that action in that state:

Can be built up recursively:

Naive attempt at a learning algorithm

You are in state x, take action a, get reward r, new state y. Then update:

Q(x,a) := r + γ maxb Q(y,b)

Recall Assignment Notation

Problem: We allow a probabilistic world (MDP), where taking the same action a in same state x can have different results.
r may vary. y may vary.
Want to average these results instead of Q-value just expressing the most recent one.

With the above rule, the Q-value will bounce back and forth:

Q(x,a) := 5
Q(x,a) := 10
Q(x,a) := 6

More sensible to build up an average:

Q(x,a) := 5
Q(x,a) := 1/2 (5 + 10)
Q(x,a) := 1/3 (5 + 10 + 6)

How do we do this? Do we need an ever-growing memory of all past outcomes?

Building up a running average

Core Q-learning algorithm

In 1-step Q-learning, after each experience, we observe state y, receive reward r, and update:

(remember notation).

The Q-value is updated in the direction of the current sample.

Note that we are updating from the current estimate of Q(y,b) - this too is constantly changing.

If we store each Q(x,a) explicitly in lookup tables, we update:

Q-values are bounded

Since the rewards are bounded, it follows that the Q-values are bounded.


Proof: In the discrete case, Q is updated by:


so by Theorem B.1:


This can also be viewed in terms of temporal discounting:





For example, if tex2html_wrap_inline6480 , then tex2html_wrap_inline9342 . And (assuming tex2html_wrap_inline9344 ) as tex2html_wrap_inline9346 , tex2html_wrap_inline9348 .

Note that since tex2html_wrap_inline6426 , it follows that tex2html_wrap_inline9352 .

Qmax and Qmin

i.e. If take this action, you will get immediate reward rmax
and you can get next reward rmax (it is the best you can get in next state, other actions may be worse)
and you can get next reward rmax ..

i.e. If take this action, you will get immediate reward rmin
and you have to get next reward rmin (all actions in next state lead to the worst reward rmin, otherwise the Q-value would be higher)
and you have to get next reward rmin ..

i.e. You are trapped in an attractor from which you cannot escape.
You cannot ever again get out and go to a state with rewards greater than rmin (and such states must exist).
Not necessarily a single absorbing state, but a family of states outside of which you can never leave.

Note if expected reward E(r) = rmin, then you will definitely get rmin on this action (r deterministic, though y can still be probabilistic).
If any reward r   >   rmin has non-zero probability, then E(r)   >   rmin

E(r) = p rmin + (1-p) r
> p rmin + (1-p) rmin
= rmin

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.