Notes:
and the geometric series 1 + r + r2 + r3 + ... with 0 < r < 1 is a finite sum. Proof.
Consider:
At start, whole map is 1.
After a time, unvisited (x,a) pairs
still have α(x,a) = 1.
Visited ones are down to 1/n
(which means visited how many times?).
Unvisited states x have α(x,b) = 1 for all b.
Each Q-value Q(x,a) converges to Q*(x,a)
where the optimum action to take in x is the action a* s.t.
Q*(x,a*) = maxb Q*(x,b)
But what normally happens is:
Long before the Q-values finalise,
we will have fixated on the best action.
Often surprisingly early,
we find that:
maxb Q(x,b) = Q(x,a*)
(note Q, not Q*)
- that is, the leader now will still be the leader when the
Q-values converge
- the leader now
is actually a*
though we don't know it yet.
In practice, the competition rapidly settles down into a contest between 1 or 2 actions.
"Infinite no. of samples" seems like a very weak convergence result. But it is only demanded because in a probabilistic world there is always a non-zero probability of any pattern of unlucky samples - e.g. Could get 1000 heads in a row, even though the coin is not only not biased towards heads, but in fact is biased towards tails. In reality, we would be very unlucky not to be operating according to an optimal policy quite quickly.
Q. If you did one 1000-toss experiment a day, how long would you wait to see a single 1000-head event?
Another way of looking at it:
As we learn, we have a policy. Eventually this policy is optimal. We can definitely say:
As time goes to infinity, p(policy is optimal) -> 1.In practice, need to be very unlucky not to have optimal policy after short finite time.
e.g. After short finite time, p(policy is optimal) = 0.99999.
Q(y,b) = 0 + γ E(V(next state))V(z) is very high, but y,b only leads there with prob. 0.01.
Actions other than c in z are worthless.
Optimal policy is a in x, b in y, c in z.
But will take a long time to learn this.
Because it's hard to get to y at all.
And even harder to get to z at all.
And you have to get to both many times in order to find b and c.
Given that a real-world experiment will end (has to end) after n timesteps, for some finite n, it is easy to design a MDP such that it is highly unlikely that you will learn an optimal policy in n steps. Can do this no matter what n is, so long as it is finite.
How do we know your problem world is not like this? You just have to hope your problem world is not like this. You just have to hope your n is big enough.
To "forget" old stuff, reset
α = 1.
But in fact you don't have to:
α
may start anywhere along the sequence
and conditions for convergence satisfied.
See Theorem A.2.
So can just keep learning and old stuff is eventually wiped.
e.g. Say world changes from MDP1 to MDP2 after time t. Just keep going with Q-learning and will learn optimal policy for MDP2 (eventually) and will forget what it learnt for MDP1 (eventually). No need to change anything.
RL - Have to actually be in state x
and have to actually execute action a to find out what happens.
Problem - Can't systematically say "for all x,a, go to x, take a"
because don't know how to get to x.
That's the whole problem.
If we knew how to get to any x, no need to learn anything!
Just say "start chess game. go to checkmate" and you're done!
DP doesn't know how to get to x either, but it doesn't have to actually get there. RL does have to actually get to x.
If we don't know Pxa(y), then Q-learning can repeatedly sample the world to construct an optimal policy, if it is a Markov world.
But of course if we don't know Pxa(y), how do we know it's a Markov world!
If we know it's an MDP, we don't need Q-learning!
If we don't know it's an MDP, we don't know if Q-learning works!
In practice, no one ever applies Q-learning to an MDP!
Q-learning is only ever applied to unknown worlds, that we hope can be approximated by an MDP (i.e. applying Q-learning will produce an alright policy).