Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org

Missing
DCU student

CASE3 student Paul Bunbury is missing since Thur 2 Feb 2012.
See appeals on crime.ie and garda.ie and facebook.

He is a great coder. See DCU page and boards.ie page.
He won major coding contests in 2010 and 2011.
He is author of the brilliant "FloodItWorld".
DCU can confirm that in Jan 2012 he passed all 6 modules comfortably.


The Control Policy


2.2.3 The Control Policy

Consider - In state x, choice of 50 actions. Pick 1, useless.
Eventually, 10,000 timesteps later, we're in x again, pick another 1 (action), useless.
Eventually, in x again and by chance we pick the future a*. Normally this gives r with p=0.8. Unfortunately this is one of the p=0.2 times so it returns 0. We can't tell difference between it and the other actions.
Even worse, next time we are in x, we try a worse action b which normally (p=0.9) gives reward 0, but just this once returns r = 1. So right now we think b is the best.
We have to try b multiple times in x to realise how good it is, and we have to try a* multiple times in x to see its probabilities.


Random learning

Random learning: Take random action in each state.
Do random learning for finite time, then exploit.

Problem - May only ever visit boring, mainstream states - only small part of the entire statespace.
Interesting area of statespace: Hard to visit. Need skilled sequence of actions to even get into these states. But high r's.


Heuristic

When can't try all a in all x repeatedly in the finite time available:

Non-random exploration policy to do "smart" exploration in the finite time available. Get focused on a small no. of actions quickly. Get increasingly narrow-minded. Need a heuristic rule: some rule to focus on promising actions and states.

Note however that if we have any policy other than try all a's endlessly in all x's, we may be unlucky and miss the optimal actions. A heuristic is "a rule that may not work" (could be unlucky). In a large state-space we may have no choice and have to take that chance.

Q. What other heuristics could we use?




Boltzmann "soft max" probability distribution

Instead of random actions, we tend to pick actions that have generated rewards before. As time goes on we only focus on one or two actions in each state, e.g. above we gradually focus on a* and b in state x, and eventually solve that tie-break.

Here the selection is between actions a, which have Q-values Q(x,a). In state x, the probability of picking action a, px(a), is:

Highest Q will always be strictly more probable.




Example

Say Q-values start at zero:

a        a1    a2    a3    a4   a5   a6
Q(x,a)   0     0     0     0    0    0
After some time, they are as follows (actions on 0 haven't been tried yet):
a        a1    a2     a3    a4    a5   a6
Q(x,a)   0     3.1   -3.3   0    -3.5  3.3
Now a high temperature would still pick actions randomly:
a      a1    a2    a3    a4    a5    a6
p(a)   1/6   1/6   1/6   1/6   1/6   1/6

A moderate temperature would pick actions with probability something like:

a      a1    a2    a3    a4    a5  a6
p(a)   0.2   0.3   0     0.2   0   0.3
A low temperature would pick actions with probability:
a      a1  a2    a3    a4  a5  a6
p(a)   0   0.1   0     0   0   0.9


Tried v. Untried actions

A good question is how does our policy deal with actions that haven't been tried yet. Q-values may start at any values. 0 is not necessarily neutral. 0 isn't necessarily small (or large). It might be a high or low Q-value.


  1. If we initialise all Q-values to Qmax (remember bounds), then every action will get tried at least once?
    Only if we have time. If not, we could finish after finite time with the best actions being all actions we have never tried!
    i.e. After learning, our policy of pick the best action yields a more-or-less random policy!

    Could change it to distinguish between tried and untried actions:
    e.g. After learning, policy is: "Take best Q-value of an action that has been tried.".


  2. If we initialise all Q-values to Qmin, then some actions might get dropped without ever having been taken.

    Could change it to distinguish between tried and untried actions:
    Be narrow-minded about a's we have tried, random about a's we haven't tried?
    Toss a coin and either pick random untried action or else pick from tried actions with Boltzmann.


Note we do know that the action hasn't been tried, because we are keeping a different α for each pair (x,a) (see here).


Q(x,a) = 0 isn't necessarily a neutral initial value

Starting Q-values at 0 isn't necessarily neutral. e.g. All r's in the system are negative.
But even if we have a "normal" reward function:
if (good thing) r = 1
else if (bad thing) r = -1
else r = 0
then 0 is a neutral reward, but that doesn't mean it's a neutral Q-value.
e.g. If (bad thing) happens all the time, no matter what you do, and if the best an optimal policy can manage is a few less (bad thing)'s than other policies, and a very rare (good thing), then the best Q-value might still be negative.


Animals v. Computers

Deterministic control policy - Do finally settle on one action with p=1, others p=0. - "Computers" - Does same thing every time - Brittle - Infinite loops - Crashes

Stochastic control policy - Still have non-zero probabilities at run-time. - "Animals" - Usually, but not totally predictable - Don't "crash" or get stuck in infinite loop - Will eventually do something different



Admittedly, animals not getting stuck in infinite loop is also to do with: world changes (no absorbing states), internal state changes (hunger), ongoing competition among multiple drives, etc.


I say in PhD that even when finished learning: "instead of using a deterministic strict highest Q I used a low temperature Boltzmann". - I am assuming I do not have an optimal policy (e.g. I have only approximated Q, perhaps not MDP anyway).

Makes no sense to do this if I have an optimal policy. - Then should use strict highest Q.





Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.