Dr. Mark Humphrys School of Computing. Dublin City University. My big idea: Ancient Brain Search:

Help on displaying equations

```
```

```

```

# 17 Further work

Finally, we examine possible further work to consider where this approach to action selection is leading.

# 17.1 Further empirical work

This thesis may look like some kind of definitive empirical comparison of these methods but it is in fact far from the last word. The real contribution of this thesis is to invent these methods in the first place, show their motivation, their expected strengths and weaknesses, place them in context relative to each other, and show how the initial empirical results support what the theory would predict. There are a number of untested methods (though their expected performance has been discussed in the text):

• Untested methods that we expect to be useful
• The pure Minimize the Worst Unhappiness when actions are shared (§13).
• W-learning with full space with a declining Boltzmann distribution for W instead of random W winners (§10).

• Untested methods that we do not expect to be interesting
• Other static measures of W, such as W = importance (§5.4). We expect them to have the same problem as static W=Q, of not being able to take opportunities (see §9.1).
• Any scaled static or dynamic measures of W (§8.1.2, §15.4.5). These won't provide any advantage because we do not want to fix agents' inequalities relative to each other (see §8.1.2).
• The Collective methods (§12). [Whitehead et al., 1993] found that Maximize Collective Happiness performed not much different from W=Q (while both were better than monolithic Q-learning).
• The product method of Maximize Collective Happiness (§15.5.2) - once the correction is made that rewards (and hence Q-values) are all made > 0.
• The standard deviation method of Maximize Collective Happiness (§12.4).

The artificial world I used increased in complexity throughout the thesis, but as was revealed in §9.1, still proved too simple to properly separate the two approaches of Maximize the Best Happiness and Minimize the Worst Unhappiness. To properly contrast the two approaches, they need to be tested in an environment in which the rewards of opportunism (and the costs of not being opportunistic) are greater.

When testing the Collective methods, the world should also be one where some agents have non-recoverable states. This world probably doesn't have enough such states. If not obeyed at some point, an agent can almost always recover its position within a small number of steps.

As just discussed, the definitive empirical test probably needs more than just a global fitness function as a linear combination of rewards. It should also include an explicit penalty for solutions that ignore subgoals.

# 17.2 Scaling up Reinforcement Learning

In scaling up Reinforcement Learning to complex problems (such as robot control), ways of breaking up the problem and its statespace are clearly needed. The problem is normally phrased (e.g. see [Kaelbling et al., 1996, §6.3]) as that of task decomposition - breaking up the problem into subtasks that must be achieved in serial or in parallel.

It would be misleading to see this thesis as providing a solution to the general problem of decomposing problems into sub-tasks. Rather, its contribution is, given a collection of agents, find automatic ways of resolving their competition so that they can share the same creature in a sensible manner. Once put together, the agents in W-learning automatically find each other's weak spots, and discover the areas on which they can and cannot compromise.

Then the problem becomes one of designing a suitable collection of agents. This is not at all trivial, but it is hoped that it will be easier than designing the details of their action selection. We do have some rules of thumb to help in designing these collections - for example, to increase the influence of an agent in the creature's mind, slowly increase the differences between its rewards (recall §8.1.3).

This system has the property of being able to "absorb" change - increasing an agent's strength doesn't guarantee that it will do better in every single state. In some states, a change of leader might reawaken a dormant agent, who suddenly finds that the agent that used to fight its battles for it is no longer winning. In other words, an agent increasing its strength will be unlikely to suddenly ruin all the other agents. They will give up their ground slowly, giving up less important states but holding on to the end to their crucial states. This is all done automatically.

The promise of RL is that designing a reward function will be easier than designing the behavior itself. Similarly, in addressing the Action Selection problem, it is hoped that designing well-balanced collections (and letting the competition resolution be automatic) will be easier than designing the detailed action selection itself.

## 17.2.1 What problems can this be applied to?

In general, we cannot expect to apply these techniques to all multi-goal problems. If goals are so different that they effectively can't be interrupted or interleaved, the ideas in this thesis may not be of much help.

Note however that the W-learning algorithm does not force goal-interleaving. It only discovers whether goal-interleaving is possible. Competition can actually result in serial behavior in practice. Imagine agents that have almost no areas of agreement, where one agent's best action is always the other's worst action, and vice versa. As soon as one agent gets started (gets to take a few actions down its goal path) it will be too strong compared to the others that are not started, since it will have taken us into states where they can get no reward, yet at the same time they would cause it a considerable loss if they take over. Their loss by not getting their goal started cannot compete with its loss if it loses the lead. So it will fight them off and run until completion. Then another agent gets a chance, and runs to termination. And so on. In practice, goals are satisfied serially in this case. This is simply an extreme version of the persistence-enforcing positive-feedback mechanism.

## 17.2.2 Competition in Single Goal problems

This thesis has been concerned with the Action Selection problem - the problem of choice between multiple conflicting goals at run-time. I now turn all this on its head by applying it to the problems that Reinforcement Learning has been primarily interested in - problems having one single goal.

Nothing in our model forces the competing agents to be actually trying to solve different goals. To solve a single-goal problem, we put together a large number of agents with different reward functions but all roughly trying to solve the same thing - each perhaps taking radically different approaches, with statespaces and actionspaces that share nothing in common - and we let them struggle via W-learning to solve it. If there are multiple unexpressed behaviors, and lots of overlap, this may be a small price to pay if there is a robust solution.

This has some resemblance to Minsky's "accumulation" of multiple different ways of solving a problem in his Society of Mind [Minsky, 1986]. Some may be more efficient than others, but we keep them all for robustness. Or more subtly, it might be hard to rank them in a single order. Some may be more efficient some times (for some x) but less efficient other times (for other x).

Having a noisy collection of competing agents to solve the problem may seem a dangerous strategy, but remember that if the creature is looking as if it is going to make some serious error, individual agents will come in with large W-values and take it all over for a while. We positively want collections that single agents can take over and dominate (such as Minimize the Worst Unhappiness) if they feel strongly enough.

Note that we could try something like this in Hierarchical Q-learning too (but would pay the price of a large (x,i) statespace).

# 17.3 Different source of numbers

In this thesis, I introduced Reinforcement Learning first, and then showed how these numerical values could be propagated further to solve the action selection problem. But I could just as easily have introduced the action selection methods and left open where the numerical values came from. They could also be evolved randomly, designed explicitly, or come from some other learning method.

## 17.3.1 Symbolic W-learning

In particular, any system that builds up "fitness" values for actions F(x,a) can be used as the basis of W-learning type schemes. If a symbolic AI system can answer either the question before the fact: "You want to take action . If I force you to take action , how bad will that be?" or the question after the fact: "OK I ignored you and did something. How bad was it?" then a symbolic AI W-learning system can be built.

# 17.4 Parallel W-learning

As should have been obvious throughout, and as is illustrated by Figures 5.3 and 6.1, W-learning almost demands a parallel implementation. Agents can work out their suggested action and update their Q and W values independently, there being no serial links between them at all.

Most parallel multi-agent systems postulate low-bandwidth communications between agents anyway. With basic (not Negotiated) W-learning there is perfect parallelism, where there is no communication between agents at all. What communication there is takes place between agents and the switch. Similar to schemes such as Maes' Spreading Activation Networks [Maes, 1989, §7.2], no problem-specific agent language is used. Only numbers are communicated.

Consider again Watkins' motivation [Watkins, 1989] of finding learning methods that might plausibly occur in nature, in particular methods that impose limited computational demands per timestep. Once the Q-values have been learnt, Q-learning provides a fast control policy - ignore rewards, don't learn, and simply act on the best Q-value. A Q-learning-controlled creature is a pure stimulus-response machine. Parallel W-learning would retain all the speed of stimulus-response in a more sophisticated architecture.

How to parallelise is explained in more detail in §14. See the "Number of updates required per timestep per agent" and also the "Restrictions on Decentralisation".

# 17.5 The Bottom-up Animat approach

The whole motivation for the behavior-based AI project (see [Dennett, 1978]) is to understand simple complete creatures and gradually move on to more complex complete creatures (as opposed to the traditional AI approach of trying to understand sub-parts of already-complex creatures in the hope of combining these parts into an understanding of a whole complex creature).

The behavior-based AI approach involves a certain discipline in not moving on to more complex models until simpler ones have been exhausted and understood. There is a great temptation in the House Robot problem to start equipping the creature with internal working memory, context, world models, map-building ability, a concept of time, plans, explicit goals, representations, symbols, reasoning, and so forth. One ends up with a hand-designed ad-hoc system from which very few if any general principles can be learnt.

I take the broad approach of Wilson [Wilson, 1990] in pushing the limits of simple animat models, and resisting as long as possible the temptation to introduce more complex models:

For example, Todd et al. [Todd et al., 1994] show that even the possibilities of sensor-less(!) memory-less creatures have not yet been fully explored. Their creatures find (by evolution) the optimum p(a) , the probability of generating an action (independent of the state of the world).

Certainly it is clear from this thesis that the possibilities of stimulus-response have not remotely been exhausted. In particular, societies of stimulus-response agents. Instead of building more complex components, I'm heading in the other direction, of more complex societies of relatively simple components. Actually I call these relatively simple components because they are stimulus-response, but in fact making hierarchies or societies out of components as complex as entire learning behaviors has only been tried relatively recently.

Where all this is leading is towards ecosystem-like minds, with structure in their collections, multiple competitions, and dynamic creation and destruction of agents. But I have been building carefully from the bottom up, instead of jumping direct to such complex systems.

# 17.6 Dynamically changing collections of agents

## 17.6.1 The Ecosystem of Mind

This work was partly inspired by Edelman's vision of a "rainforest" mind, with life and death, growth and decay, in a dynamic ecosystem inside the head [Edelman, 1989, Edelman, 1992]. In his model, called Neural Darwinism, competition is between structures called neuronal groups.

The idea is appealing, but it is difficult to tell whether neuronal groups are meant to be action-producing agents or just pattern classifiers. Edelman presents no explicit algorithm to show how the model is implemented. In fact, presenting arguments only about symbolic AI, he draws the familiar conclusion that no computer algorithm[1] could implement his ideas. No mention is made of self-modifying, reinforcement-learning agents embedded in the world, to which his criticisms do not apply.

Here, with W-learning, we have the basis for a full living and dying ecosystem inside the creature, where what comes into existence, struggles, and perhaps dies is not mere connections (as Edelman may in fact mean) but entire autonomous goals. In standard W-learning, there is one competition between the agents, this is resolved, and the creature is then fixed. In a dynamic collection, agents would continually adjust their W-values as new agents came into existence, old agents died, and the nature of their competition changed.

## 17.6.2 Invasion of strong new agents

In the earlier version of this document [Humphrys, 1996a, §17.5.1] I briefly discussed the possibilities for constructing dynamically changing collections. The basic problem with a dynamic collection though, is, if there is an automatic generator of random new agents, that it can just generate a single new, extremely strong agent (large differences between its rewards) which immediately takes over the whole creature, turning a carefully-built creature into a zombie overnight.

In fact, there would be an "arms race" between such strong agents over the lifetime of the creature, as the agent-generator generated an even stronger agent which took over in its turn, and so on. The creature would change from being one type of zombie to another over its life. Our dilemma is caused by the fact that we want the mind to be driven by strong individuals in each state (we don't want woolly collective decision making). But then one strong individual can take over.

There are a number of possible ways of making collections hard to invade without going back to majority-rule, but at this point we must ask how useful this line of thinking really is from the point of view of engineering.

When thinking about animal minds, it is attractive to think of new agents (ideas, memes) coming into existence during the life history, and trying to claim some space in the already-crowded mind. But from the point of view of engineering, concepts like "individual" are only useful myths. If a collection is invaded and destroyed, we simply note that the new collection didn't work, reset it to what it used to be and try something else. Resisting invasion may be important in biology to preserve the integrity of the individual, but in engineering this is just another way of building things. Only if we sent our robot out into the world, with the ability to communicate, imagine, and form new goals unsupervised over long periods of time, might questions of maintaining individual integrity start to arise. Even then, an artificial system can restore a past backup in a way that a natural system cannot.

So for the moment, dynamic collections are not so interesting since in their basic form they are just another way of finding combinations of agents in the same collection. More interesting for now is to discuss whether the collection itself can start to have any structure. This is the subject of the final chapter.

[1] This may all be a matter of terminology. Edelman's co-worker Olaf Sporns says [Sporns, 1995] that they do not like to use the word algorithm for a self-modifying algorithm embedded in the real world. A computer scientist would say that it still is an algorithm, even if it has effectively become part of the world.

```
```