Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

My big idea: Ancient Brain


CA114      CA170

CA668      CA669      Projects


In the discrete case, we store each Q(x,a) explicitly, and update:


for some learning rate tex2html_wrap_inline6315 which controls how much weight we give to the reward just experienced, as opposed to the old Q-estimate. We typically start with tex2html_wrap_inline6735 , i.e. give full weight to the new experience. As tex2html_wrap_inline6315 decreases, the Q-value is building up an average of all experiences, and the odd new unusual experience won't disturb the established Q-value much. As time goes to infinity, tex2html_wrap_inline6739 , which would mean no learning at all, with the Q-value fixed.

Remember that because in the short term we have inaccurate Q(y,b) estimates, and because (more importantly) in the long term we have probabilistic transitions anyway, the return from the same (x,a) will vary.

We use a different tex2html_wrap_inline6747 for each pair (x,a), depending on how often we have visited state x and tried action a there. When we are in a new area of state-space (i.e. we have poor knowledge), the learning rate is high. We will have taken very few samples, so the average of them will be highly influenced by the current sample. When we are in a well-visited area of state-space, the learning rate is low. We have taken lots of samples, so the single current sample can't make much difference to the average of all of them (the Q-value).

Convergence conditions

[Watkins and Dayan, 1992] proved that the discrete case of Q-learning will converge to an optimal policy under the following conditions.

The learning rate tex2html_wrap_inline6315 , where tex2html_wrap_inline6763 , should take decreasing (with each update) successive values tex2html_wrap_inline6765 , such that tex2html_wrap_inline6767 and tex2html_wrap_inline6769 .

The typical scheme is, where tex2html_wrap_inline6771 is the number of times Q(x,a) has been visited:


If each pair (x,a) is visited an infinite number of times, then Watkins and Dayan showed that for lookup tables Q-learning converges to a unique set of values tex2html_wrap_inline6777 which define a stationary deterministic optimal policy.

Notes on α

The learning rate α,
where tex2html_wrap_inline6763,
should take decreasing (with each update) successive values tex2html_wrap_inline6765,
such that tex2html_wrap_inline6767 and tex2html_wrap_inline6769.

  1. The harmonic series 1 + 1/2 + 1/3 + ... is an infinite sum.
    Proof: Oresme

  2. The infinite series: 1 + (1/2)p + (1/3)p + ... is a finite sum if p > 1 (and infinite for p < = 1).
    Proof: The sum is equal to:
    1   +   ( (1/2)p+(1/3)p )   +   ( (1/4)p+(1/5)p+(1/6)p+(1/7)p )   + ...
    < 1   +   ( (1/2)p+(1/2)p )   +   ( (1/4)p+(1/4)p+(1/4)p+(1/4)p )   + ...
    = 1   +   2 (1/2)p   +   4 (1/4)p   + ...
    = 1 + r + r2 + r3 + ...
    where r = (1/2)p-1
    p-1 > 0, i.e. r < 1


    and the geometric series 1 + r + r2 + r3 + ... with 0 < r < 1 is a finite sum. Proof.

Consider: tex2html_wrap_inline6795 = 1, (1/2)p, (1/3)p, ...
Need: 1 + (1/2)p + (1/3)p + .... to be infinite,
hence: p < = 1
Need: 1 + (1/2)2p + (1/3)2p + .... to be finite,
hence: 2p > 1
hence: tex2html_wrap_inline6797.

tex2html_wrap_inline6795 will satisfy the conditions for any tex2html_wrap_inline6797 .

Map of α(x,a)

Consider map of α(x,a)

At start, whole map is 1.
After a time, unvisited (x,a) pairs still have α(x,a) = 1.
Visited ones are down to 1/n (which means visited how many times?).
Unvisited states x have α(x,b) = 1 for all b.

The optimal policy

Q-learning is asynchronous and sampled - each Q(x,a) is updated one at a time, and the control policy may visit them in any order, so long as it visits them an infinite number of times. Watkins describes his algorithm as "incremental dynamic programming by a Monte Carlo method".

After convergence, the agent then will maximize its total discounted expected reward if it always takes the action with the highest tex2html_wrap_inline6781 -value. That is, the optimal policy tex2html_wrap_inline6529 is defined by tex2html_wrap_inline6785 where:




tex2html_wrap_inline6787 may be non-unique, but the value tex2html_wrap_inline6781 is unique.

Initial Q-values can be random. tex2html_wrap_inline6791 typically starts at 0, wiping out the initial Q-value in the first update. Although the tex2html_wrap_inline6793 term may factor in an inaccurate initial Q-value from elsewhere, these Q-values will themselves be wiped out by their own first update, and poor initial samples have no effect if in the long run samples are accurate (Theorem A.2 later).

Optimal policy found before convergence

Each Q-value Q(x,a) converges to Q*(x,a)
where the optimum action to take in x is the action a* s.t.
Q*(x,a*) = maxb Q*(x,b)

But what normally happens is:
Long before the Q-values finalise, we will have fixated on the best action. Often surprisingly early, we find that:
maxb Q(x,b) = Q(x,a*)
(note Q, not Q*) - that is, the leader now will still be the leader when the Q-values converge - the leader now is actually a* though we don't know it yet.

In practice, the competition rapidly settles down into a contest between 1 or 2 actions.

We usually find that a greedy policy (execute the action with the highest Q-value) is an optimal policy long before the Q-values have converged to their final values.

Need infinite time?

"Infinite no. of samples" seems like a very weak convergence result. But it is only demanded because in a probabilistic world there is always a non-zero probability of any pattern of unlucky samples - e.g. Could get 1000 heads in a row, even though the coin is not only not biased towards heads, but in fact is biased towards tails. In reality, we would be very unlucky not to be operating according to an optimal policy quite quickly.

Q. If coin is biased 75-25 towards tails, what is prob. of 1000 heads in a row?

Q. If you did one 1000-toss experiment a day, how long would you wait to see a single 1000-head event?

p(policy is optimal)

Another way of looking at it:

As we learn, we have a policy. Eventually this policy is optimal. We can definitely say:

As time goes to infinity, p(policy is optimal)   ->   1.

In practice, need to be very unlucky not to have optimal policy after short finite time.

e.g. After short finite time, p(policy is optimal) = 0.99999.

A perverse world

You could deliberately design a difficult world, in which learning the optimal policy took a long time.
Example: Pxa(y) = 0.01, Pyb(z) = 0.01.
No other way of getting to z other than taking b in y (even though most times it won't lead to z).
No other way of getting to y other than taking a in x (even though most times it won't lead to y).
z then has one action c with high rewards.


  1. z has only non-zero reward in system.
    In finite time: Will learn random policy. Never get into z. Never found any rewards anywhere.

  2. There are other rewards in the system, but z has a very high reward so that optimal policy is to go to z.
    Needs to be very high reward in z to make up for the low probability of getting it (high probability of going to other, worthless states if you take b in y).

    Q(y,b) = 0 + γ E(V(next state))

    V(z) is very high, but y,b only leads there with prob. 0.01.
    V(state) is very low for the states that y,b leads to with combined prob. 0.99.
    (0.99) low + (0.01) high
    can still be high if "high" is high enough.
    Hence Q(y,b) can be high if the r for c is high enough.

    Optimal policy is to take actions a in x, b in y, c in z.
    But will take a long time to learn this. Because it's hard to get to y at all. And even harder to get to z at all. And you have to get to both many times in order to find b and c.
    In finite time: Will learn sub-optimal policy of maximising the other rewards. Will think we've done well. Never find out about higher rewards available at z.

Can you learn optimal policy in finite n steps?

It depends what n is.

Given that a real-world experiment will end (has to end) after n timesteps, for some finite n, it is easy to design a MDP such that it is highly unlikely that you will learn an optimal policy in n steps. Can do this no matter what n is, so long as it is finite.

How do we know your problem world is not like this? You just have to hope your problem world is not like this. You just have to hope your n is big enough.

Convergence proof may be of limited use

Convergence proof is nice, but may be of limited use.

Q-learning is attractive because of the simplicity of the computational demands per timestep, and because of this proof of convergence to a global optimum, avoiding all local optima.

  1. World may not be MDP:
    One catch though is that the world has to be a Markov Decision Process. Many interesting worlds are liable to break the MDP model.

  2. Convergence only works with lookup tables:
    Another catch is that convergence is only guaranteed when using lookup tables, while the statespace may be so large as to make discrete lookup tables, with one cell for every combination of state and action, require impractically large amounts of memory. In large statespaces one will want to use some form of generalisation, like a neural network, but then the convergence proof no longer holds.

  3. Maximising r's may not solve the problem:
    Finally, even if the world can be approximated by an MDP, and our generalisation can reasonably approximate lookup tables, the policy that Q-learning finds may be surprising. The optimal solution means maximising the rewards, which may or may not actually solve the problem the designer of the rewards had in mind. The agent may find a policy that maximizes the rewards in unexpected and unwelcome ways. Still, the promise of RL is that designing reward functions will be easier than designing behavior.

    See discussion of designing reward functions here and here.

Maximising r's may not solve the problem:
My agent in this 1996 demo maximises rewards by jumping in and out of the "nest".

Maximising r's may not solve the problem:
An undergrad student of mine, Eoin Murphy, uses RL to play Super Mario in 2016.
His agent maximises r's by jumping up and down. (Starts at 1:08.)

Dynamic Programming - "For all x,a, try a in x"

Recall Dynamic Programming.
In DP - Don't actually visit any state. Don't actually take any action. Can do it all in your head. What would happen if we took action a in x.

RL - Have to actually be in state x and have to actually execute action a to find out what happens.
Problem - Can't systematically say "for all x,a, go to x, take a" because don't know how to get to x. That's the whole problem.
If we knew how to get to any x, no need to learn anything! Just say "start chess game. go to checkmate" and you're done!

DP doesn't know how to get to x either, but it doesn't have to actually get there. RL does have to actually get to x.

Use DP or RL?

If we know Pxa(y), then DP can construct the optimal policy without interacting with the world at all.

If we don't know Pxa(y), then Q-learning can repeatedly sample the world to construct an optimal policy, if it is a Markov world.

But of course if we don't know Pxa(y), how do we know it's a Markov world!


If we know it's an MDP, we don't need Q-learning!

If we don't know it's an MDP, we don't know if Q-learning works!

In practice, no one ever applies Q-learning to an MDP!

Q-learning is only ever applied to unknown worlds, that we hope can be approximated by an MDP (i.e. applying Q-learning will produce an alright policy).

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.