Dr. Mark Humphrys School of Computing. Dublin City University. My big idea: Ancient Brain Search:

# Specialisation

For the network to work, it is crucial that the hidden units do different things. They cannot all be the same. They must specialise on different aspects of the input-output mapping.

All we need do is start them off randomly different to get this process going.

```
```

# Initialisation

Note from graph of sigmoid function that large positive or negative Summed x has a very small slope dy/dx.

dy/dx = y(1-y), and at either end, one of these terms is near zero.

Hence for large absolute xk, is near zero, and is near zero too.

Large absolute Summed x (caused by large absolute weights) causes a small change in weights, very slow learning.

Large absolute weights cause slow learning.

```

```

# Initialisation Strategy

Initialise with small absolute weights (fast learning).

Question - How small is "small"?
We may have to graph the sigmoid function to make sure we are actually in the area of rapid change.

Obviously zero would be small enough.

```
```

# Initial w's = 0?

Let all weights and thresholds start at zero.

Forward pass:

• All yj's = 1/2.
All yk's = 1/2.

Backprop, output layer:

• yk's all same,
but (yk - Ok)'s are different at each k.
Let ek = (yk - Ok)
∂E/∂wjk = ek (1/2)(1/2)(1/2)
Each weight that leads to the same node k will have the same ∂E/∂wjk, hence the same change in weight, hence will stay the same.
We have different weights, but in groups:
w11=w21= ... =wj1=...
w12=w22= ... =wj2=...
...
w1k=w2k= ... =wjk=...
...

∂E/∂tk = ek (1/2)(1/2)(-1)
tk's changed by different amounts, become all different.

Backprop, previous layer:

• ∂E/∂yj = Σ k [ ek (1/2) (1/2) wjk ]
= (1/4) [ e1 wj1 + e2 wj2 + ... ]
= (1/4) [ e1 wp1 + e2 wp2 + ... ]
for any other hidden node p
Hence ∂E/∂yj same for all hidden nodes j.
∂E/∂wij = c Ii
different for each i, same for all j
We have different weights, but in groups:
w11=w12= ... =w1j=...
w21=w22= ... =w2j=...
...
wi1=wi2= ... =wij=...
...

∂E/∂tj = c(-1)
tj's all same, stay same.

Hidden nodes are all the same, stay the same:

• For any two hidden nodes p and q
p has incoming weights w1p,w2p,..., threshold tp, outgoing weights wp1,wp2, ...
q has incoming weights w1q,w2q,..., threshold tq, outgoing weights wq1,wq2, ...
These are the same as p's weights and thresholds.
p and q are identical nodes, and stay identical.

We have a symmetrical network. The hidden units march locked in step. Each node stays identical to the others in the hidden layer. Hidden units don't specialise. The net can't work (as we saw when designing them). No point having n hidden units if they're all the same. You might as well only have 1 hidden unit.

This makes sense. How could the network pick a hidden node to specialise on some part of the problem? Surely whichever node it picked, it would make the same changes for the other nodes too.

Output nodes are different.
But hidden units are all the same, and stay the same.

```
```

# Exercise

Multiple hidden units the same are useless. You can achieve the same effect with one hidden unit and different weights:

```
```

# Conclusion

Can't initialise to zero. Random small plus or minus quantities is best. See Sample code for initialisation.

```
```

```
```
Also, exemplar outputs in training should be 0.1 to 0.9, rather than 0 to 1, or else very large weights develop. Can't actually get 0 or 1 output without at least one weight going to plus or minus infinity, which causes problems for other exemplars.

```

```

# Over-learning

Network can start with random values and learn to get rid of these.

But of course that means it can learn to get rid of good values over time as well. It can't tell the difference.

If it doesn't see an exemplar for a while, it will forget it. For all it knows, it has just started learning, and the weights it has now are just a random initialisation! It keeps learning, wiping out anything too far in past.

Learning = Forgetting!

e.g. Extreme Case - We show it one exemplar repeatedly. e.g. Show it "Input x leads to Output 1", 1 million times in a row. The "laziest" way for the network to represent this is to just send the weights to infinity (or minus infinity for Input negative), so Output = 1 no matter what the Input. i.e. Instead of "x -> 1" it learns "* -> 1"

If we show it "x -> 1" a million times, then all weights may be recruited to help "x -> 1". Normally, if we show it "x -> 1" then it does have an effect on all weights, but this effect is countered by the effects of other exemplars. The way the net resolves this tension is by specialisation, where some weights are more-or-less irrelevant in some areas of the input space. Since they have little (though, if outputs are continuous, it will always be at least non-zero, no matter how tiny) effect on the error, the backprop algorithm ensures they are hardly modified. Then when we show it "x -> 1" once, it does have an effect on the weight, but the effect is negligible.

```
```

# How does specialisation happen?

How does the process of specialising work? - As the net learns, it finds that for each weight, the weight has more effect on E for some exemplars than others. It is modified more as a result of the backprop from those exemplars, making it even more influential on them in the future, and making the backprop from other exemplars progressively less important.

```
```

# Strategies for Teaching

First, exemplars should have a broad spread. Show it "x -> y" alright, but if you want it to learn that some things do not lead to y   you must show it explicitly that "(NOT x) -> (NOT y)". e.g. In learning behaviour, some actions lead to good things happening, some to bad things, but most actions lead to nothing happening. If we only show it exemplars where something happened, it will predict that everything leads to something happening, good or bad. We must show it the "noise" as well.

How do we make sure it learns and doesn't forget? - If exemplars come from training set, we just make sure we keep re-showing it old exemplars. But exemplars may come from the world. So it forgets old and rare experiences.

One solution is to have learning rate C decline over time. But then it doesn't learn new experiences.

Possible solution if exemplars come from world is internal memory and replay of old experiences. Not remembering exemplars as lookup table, as we had before, but remembering them to repeatedly push through network. But same question as before - Does the list grow forever?

See Example of remembering exemplars and replaying without needing infinite memory.

```

```
```
```
ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.