School of Computing. Dublin City University. Home Blog Teaching Research Contact My big idea: Ancient Brain 


Help on displaying equations
A further reason why Wlearning underperformed is that we still haven't found the ideal version of Wlearning. Remember that using only subspaces for results in a loss of accuracy. Using the full space for would result in a more sophisticated competition.
Consider the competition between the dirtseeker and the smokeseeker . For simplicity, let the global state be x = (d,f). sees only states (d), and sees only (f). When the full state is x = (d,5), simply sees all these as state (5), that is, smoke is in direction 5. Sometimes opposes it, and sometimes, for no apparent reason, it doesn't. But averages all these together into one variable. It is a crude form of competition, since must present the same Wvalue in many different situations where its competition will want to do quite different things. The agents might be better able to exploit their opportunities if they could tell the real states apart and present different Wvalues in each one.
If we are to make the x in the refer to the full state, then each agent needs a single neural network to implement the function. The agent's neural network takes a vector input x and produces a floating point output . The Qvalues can remain as subspaces of course. We are back basically to the same memory requirements as Hierarchical Qlearning  subspaces for the Qvalues and then n times the full state x.
Recall that if the winner is to be the strict highest W we start with W random negative, and have the leading unchanged, waiting to be overtaken. This works for lookup tables, but will not work with neural networks. First because trying to initialise W to random negative is pointless since the network's values will make large jumps up and down in the early stages when its weights are untuned. Second because even if we do not update it, will still change as the other change. And if the net doesn't see for a while, it will forget it.
We could think of various methods to try to repeatedly clamp , but it seems all would need extra memory to remember what value it should be clamped to.
The approach we took instead was: Start with W random. Do one run of 30000 steps with random winners so that everyone experiences what it's like to lose, and remembers these experiences. Then they each replay their experiences 10 times to learn from them properly. Note that when learning Wvalues in a neural network, we are just doing updates of the form . No Wvalue is referenced on the righthand side, unlike the case of learning the Qvalues. Hence there is no need for our concept of backward replay.
With a similar neural network architecture as before, the best combination of agents found, scoring 14.871, was:
which is better than Wlearning with subspaces, but still not as good as W=Q. A problem with this method of random winners is that it will actually build up each to be the average loss over all other agents in the lead:
for . So what we are doing is in fact finding:
This sum doesn't really mean anything. For example, it is certainly not the loss that the current leader is causing for the agent.
Using random winners is equivalent to a stochastic highest W strategy with fixed high temperature. We would probably have got better results if we had used a more normal stochastic highest W  one with a declining temperature. This would have multiple trials, replay after each trial, and a declining temperature over time as in §4.3.2. But we have some confirmation that telling states apart is a good thing. In the next section, we find out what happens when we can tell states apart perfectly.
Return to Contents page.