School of Computing. Dublin City University. Home Blog Teaching Research Contact My big idea: Ancient Brain 


We normally store the Qvalues in a lookup table.
Question  Why does this not work if x,a contain realnumber components?
Answer  After some experience our lookup table would look like this:
x Q(x,a1) Q(x,a2) Q(x,a3) 313.4009 0 [0.244] 0 313.7562 [1.355] 0 0 313.7564 0 [0.155] 0 current input x: 313.3170 0 0 0With real numbered states, every state is new. x has not been visited, so we don't know what to do. Every state x will have precisely one Qvalue in precisely one column, since we only visited it once, and tried one action.
NOTE  In fact, technically, a digital computer system will never deal with actual real numbers. There will always be some limit to the no. of decimal places (because there is limit to the no. of bytes the floatingpoint number is stored in). So technically, a lookup table is possible for any computer/robot system. But could be massive / bigger than memory.
In practice, if large no. of states / large no. of decimal places, get same type of problems as if actually realvalued: Never see same x twice during even a very long experiment, Lookup table huge (if not actually infinite), etc.
So in this situation
instead of using a lookup table to represent our statespace in RL,
we will use a neural network.
Input = (x,a).
Output = Q(x,a).
Need to transfer info from Q(x,a) to "nearby" (x,a) combinations
("nearby" x's, "nearby" a's).
Revision  Designing the Network
In particular, our choice of how to separate the input nodes will decide how effective our network can be in learning the Qvalues.
Might seem strange that we have to make these design decisions to get an effective network, but remember that our coarsegrained discrete lookup table is a generalisation itself. It is already a handdesigned initial approximation, deciding to separate directions into 8 compass points.
Consider where x = (f,n,p,i), where f takes values 0..9, n and p also 0..9, i takes values 0,1, and a = 0..8.
No. of possible (x,a) combinations = 18,000.
You might be able to separate the inputs if they were grouped together. e.g. The net could separate Inputs 15000 and greater (i.e. 15000  18000) by having a hidden node with weight 1/15000 and threshold 1.
The problem is, the inputs you want to group together
are likely to be scattered all over the 1  18000 range.
e.g.
Separate (*,*,9,*,*) from the rest
Separate (*,9,07,*,*) from the rest
Remember that when you learn in a neural net, you backprop the error and change all weights, thus interfering with the output for all other inputs. We adjust the weights to make the net better for x. That may make it worse for x'. If inputs lead to different outputs, we need to learn to separate them.
Only when weights or inputs are zero along some connection is there absolutely no interference. Zero weights/inputs are our main trick in helping the net separate the inputs.
Say hidden node has threshold t_{j} = 50,
and only connected to 1 input,
with w_{ij} = 1
(i.e. weight zero on other links).
Then it only fires for the region where input > 50.
If w_{ij} =  1, threshold t_{j} = 50,
it only fires for input < 50.
In this way different parts of the net fire for different regions
of ndimensional input space.
But we must give it enough links and enough hidden nodes so it can specialise.
Noting that each component of the input takes only
a small number of discrete values, have 10 + 10 + 10 + 2 + 9
= 41 input nodes, each taking values 0 or 1.
(Every input will have precisely 5 1's and 36 0's)
We can rank all our schemes, from most interference to least (lookup table):
In fact, to separate the inputs further,
because we have only a small number of discrete actions,
we could have one net per action.
We have 9 separate nets.
Each takes a vector input x and produces a floating point output:
Q_{a}(x) = Q(x,a)
Now each net has 10 + 10 + 10 + 2 = 32 input nodes, each 0 or 1.
Qvalue = x_{k}  t_{k}
(No need to normalise it with sigmoid. Goes from  infinity to infinity.)