Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

My big idea: Ancient Brain


CA114      CA170

CA668      CA669      Projects

Machine Learning

What is Learning?

It must be more than just memory:

Human:		The capital of Ireland is Dublin.
Machine:	OK.
Human:		The capital of France is Paris.
Machine:	OK.
Human: 		What is the capital of Ireland?
Machine:	Dublin.
Human:		Brilliant.

The machine should learn something new.

Generalisation - Learning a function f

The problem of induction:

We have experience of a finite number of situations. We must generalise our experience to a new situation, where we have never been told the answer, and make a good guess, where we are not simply recovering a past answer from memory.

Human:		In situation x1, answer is a1.
Human:		In situation x2, answer is a2.
Human:		In situation x3, answer is a3.
Human:		In situation x4, what is answer?
Machine:	f(x4).
where the machine is "learning" the function f.

Each Input-Output pair (xi, ai) that we show to the machine is called an exemplar. Learning from exemplars like this is called supervised learning.

f(x4) might be an interpolation (if x4 is between 2 of the points x1,..,x3) or an extrapolation (if x4 is outside the previous points).
For example, the machine might progress as follows:
Human:		In situation x1, answer is a1.
Machine:	OK. I am building a function f(x). 
		Currently, f(x) = a1 for all x.

Human:		In situation x2, answer is a2.
Machine:	f is becoming a more complex mapping, 
		that gives different results for different x.
		e.g. Perhaps currently f((x1+x2)/2) = (a1+a2)/2.

Question - If we have 2 points so far, we may (so far) assume f is a straight line. What is the equation of f(x)?

Human:		In situation x3, answer is a3.
Machine:	f is getting more complex.
Maybe now f is no longer a straight line. We still may estimate f((x1+x2)/2) = (a1+a2)/2, but from now we restrict ourselves to interpolations rather than extrapolations.
Human:		In situation x4, what is answer?
Machine:	Using my current definition of f, my answer is f(x4).

If x=0, answer is 0.
If x=100, answer is 100.
If x=50, answer is 95.

If x=25, what is answer?

Representing a function

The idea of Representing a function from exemplars is that "nearby" Input should generate "nearby" Output.

But some functions defeat this simple idea:

Chaotic functions

Chaos Theory demo

Non-chaotic functions (can be approximated to good accuracy by finite data structure)

We do not expect that the learning method from exemplars will be able to learn to represent a chaotic function (or a function in a region where it is behaving chaotically).
To represent the chaotic area the function would have to be learnt to an impractically (perhaps infinitely) high level of granularity.

However we do hope that our unknown function is not actually perverse enough to be chaotic, at least in some local regions.

We expect that the learning method can cope with any reasonably well behaved non-linear, continuous function. For instance, the function:

f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)
Here is the kind of thing we want - a learning method approximating the above function after having seen 1 million exemplars (top) and 5 million (bottom):

See details later.


Fractals - The same pattern repeats at different scales. Repeatedly magnify it and it looks the same. Here we could possibly represent it to reasonable accuracy (not perfect) using a finite data structure, because the range decreases. A small change does not bounce across the global range, as in the chaotic functions above.

e.g. Mandelbrot set:

Public domain image from here.

Equation for f     v.   No equation for f

Note that if we have an explicit equation for f, as above, then we don't need supervised learning at all! - Represent f by equation.

Supervised learning from exemplars is entirely for situations where we have an unknown f - we simply postulate that some mapping from our inputs to our outputs must exist. e.g. Hand-writing recognition. - Represent f by adaptive data structure.

So why do I use explicit f? Because they make nice examples. Use them to imagine what an unknown f might look like. With explicit f, we can see how well the machine is doing. Use explicit f to debug your learning machine when first write it.

Unknown or postulated function

A typical application for which there is a learnable Input-Output relationship:

Character recognition:

Equation for f   v.   Can represent f

No equation for f, but can represent it:
It is interesting to think that even though we can never hope to explicitly define the function mapping a particular hand-drawn bitmap to a value 0-9, we can still learn this correlation.

Equation for f, but can't represent it:
Whereas with the Chaos Theory demo function, we actually know explicitly what the function is, and it is completely deterministic, yet we will not be able to learn to represent it.

It is not whether a mapping is knowable explicitly or not that makes it learnable or not.

Other applications

"Pattern recognition" can mean lots of things. It could mean learning a mapping from Input "state" to Output "behaviour".

e.g. Input state = (wall at certain angle, velocity at certain value), Output behaviour = one of (turn left, turn right, stop, keep going).

It means recognising a dangerous situation on the road. Medical diagnosis. Financial prediction. It means learning to walk, balance, run, jump, climb. Speech (both hearing and production). Vision. Anything where explicit rules are often hard to come by, but correct feedback is at least possible to construct.

Sample applications of Supervised learning

Neural Networks are one of the main types of machines that can implement supervised learning.

Optical character recognition (OCR)

Optical character recognition (OCR) of printed books should be easy (compared to handwriting) but is often surprisingly hard.

Page of an 18th century book.

A typically inaccurate OCR transcript of the above.
Note the points where it has trouble.

Discontinuous function

Some mappings of Input to Output - where there is only a disjoint, discrete mapping of certain inputs to certain outputs, and in no areas any continuous (or even approximation of continuous) mapping - do not lend themselves to learning, only memorisation.

e.g. Capitals:

Human:		The capital of Ireland is Dublin.
Machine:	OK.
Human:		The capital of France is Paris.
Machine:	OK.
Human: 		What is the capital of the UK 
		(the country between Ireland and France)?
Machine:	Emm, my guess is "Dublaris".
Human:		What is the capital of Iceland?
Machine:	Emm, "Dublin"?

There may be a function in that every argument has a unique value generated for it. But it is not continuous - no interpolation is possible.

Do we know a priori if our function f is of this sort? i.e. Is there anything to learn?

Have we listed all the inputs to construct a predictable function?

Remember f might be simply a postulated function - a function that is assumed to exist, but whose properties are unknown (and possibly unknowable):

With jockey of weight 8 stone, our horse won the race.
With jockey of weight 8.5 stone, our horse lost the race.
With jockey of weight 9 stone, our horse won the race.

With jockey of weight 8.25 stone, will our horse win or lose?
With jockey of weight 8.75 stone, will our horse win or lose?

Are these the correct inputs? Are we missing some inputs? How do we know?

Probabilistic / Stochastic / Non-deterministic functions

If x=0, answer is 0.
If x=100, answer is 100.
If x=0, answer is 10.
If x=100, answer is 30.
If x=100, answer is 100.
May be caused by missing inputs.

In fact, must be caused by missing inputs, unless the universe as a whole is probabilistic.

Multi-dimensional functions

If input x is multi-dimensional, how do you build up this function f? How do you represent f?

f could be a function of the series of inputs x1,x2,...,xn that led up to the current input xn, rather than just a function of the immediate input xn.

A simple prediction machine using Distance

No interpolation. No extrapolation. Just memory.

Given a new input x,
find the past exemplar with the smallest distance from x.
Return the output for that past exemplar.

Distance in Multi-dimensions

If x is multi-dimensional - Then what is "nearby"?

Here is one possibility. Let x = (x1,..,xn) and y = (y1,..,yn). Define the Hamming distance between x and y as:

If x and y are vectors of binary digits, then the Hamming distance is just equal to the number of points at which x and y differ.

Now, our algorithm could be:

Given a new input x,
find the past exemplar with the smallest Hamming distance from x.
Return the output for that past exemplar.

The problem with this can be seen in the following. Let the exemplars be:

Temperature 22 degrees, pressure 1000 millibars. No rain.
Temperature 15 degrees, pressure 990 mb. Rain.
Then the question is:
Temperature 17 degrees, pressure 1010 mb. Will there be rain?
The Hamming distance from the first exemplar is 5 + 10, the distance from the second is 2 + 20. So we say it is nearest to the first exemplar.

However, we have just added together two different quantities - degrees and millibars! We could just as easily have said the exemplars were:
Temperature 22 degrees, pressure 1 bar. No rain.
Temperature 15 degrees, pressure 0.990 bars. Rain.
and the question:
Temperature 17 degrees, pressure 1.010 bars. Will there be rain?
Then the distance is 5 + 0.010 from the first, 2 + 0.020 from the second. The second exemplar is closer.

Which is correct? Which one are we closer to?

The answer is really: If it rains, we must have been closer to the second exemplar. If it doesn't rain, we must be closer to the first. "Closer" is defined by observing the output, rather than pre-defined as in a distance metric.

The same applies to any other distance metric, such as the "real" distance metric - the distance between 2 points x = (x1,..,xn) and y = (y1,..,yn) in n-dimensional Euclidean space:

(e.g. consider n=2 or n=3). Again, we can change the scale of one axis without the other, and so change the definition of the "nearest" exemplar. It is just easier to show this with the Hamming distance.

Distance metric is very crude

Even if we do not have the above problem of heterogenous inputs on different scales, a distance metric is still a very primitive form of predictor.

e.g. Consider classifying a bitmap of 100x100 pixel images. The components of the input are all the same type of thing - pixels that are, say, 0 (black) or 1 (white). The distance metric is the number of pixels that are different. This strategy of looking at distance from exemplars is called template matching.

The basic problem with it is that it weighs every dimension (every pixel) as of equal importance.

Consider bitmaps of faces. We are trying to distinguish happy faces from sad faces. Say we introduce a new, happy face with a similar haircut to one of the exemplars with a sad face (i.e. we were told: "This face is sad"). The haircut area of the bitmap scores so many hits under the distance metric that this becomes the exemplar that matches, and we guess that the new input is a sad face.

In fact, matching the exemplar with the same haircut / background / ears is a reasonable first guess - after all, how does the machine know in advance that it should be looking at the mouth and eyes, and ignoring the other parts?

The problem is the system cannot learn this. The distance metric is fixed. It cannot learn that some parts of the input space are irrelevant.

It would be more useful, and could be applied to much more problems, if the machine could learn which are the most important inputs, and learn to weigh some higher than others because they have a stronger affect on output. We just feed inputs into it. If they are irrelevant it will learn to ignore them.

Compare to all previous exemplars?

Another issue, in the previous, is does our algorithm have to compare current input to all previous exemplars? It would be very laborious, and take a long time, to compare every pixel of the input with every pixel of a large number of previous exemplars (and it must be compared to them all, every time). And the list of exemplars to compare to keeps getting larger.

The Neural Network data structure

Neural Networks are a type of data structure that:

  1. can represent a function f from real-valued multi-dimensional input to real-valued multi-dimensional output,
  2. can learn a general non-linear, continuous, non-chaotic function
  3. can build up the representation of f in a variable-resolution manner, learning which components of the input are important and which are not,
  4. can build up the representation from scratch based only on seeing exemplars,

  5. can do this in a fixed-size data structure, that remains fixed in size as the representation improves, [Fixed memory cost]
  6. can return an answer quickly to a given input without having to examine every past exemplar (i.e. the past exemplars are effectively encoded in the current definition of f - instead of having a huge list of past exemplars growing without limit), [Low computational cost]
  7. can return an answer in a finite time (in fact, in exactly the same time for all inputs), [Fixed computational cost]
  8. can return an answer for inputs never seen before (i.e. can make predictions),
  9. is a behaviour-generating module from the start (i.e. we don't gather exemplars and then run a statistical analysis on them - instead we always have a definition of f, which can return an answer at any time).

You may have heard of Neural Networks because they have been advertised as "a type of program that can learn". But they are far more than that. Traditional programs can learn too. Neural Networks have some very specific features (above) that mean they can be used for this problem.

What is memory? How does the brain use its experience?

Example - How do you play chess?

  1. Reasoning. Applying rules. Lookahead search. Memory-free.

  2. Compare current state to explicit memory of all previous matches.
    1. Computational cost - Gets worse with experience. Have to compare to all previous experiences, each time. Takes increasing amount of time.
    2. Memory cost - Grows indefinitely. Need more and more memory as you have more experience. Data structure growing indefinitely in size.

  3. Run current state through a "neural machine" that you have been building for 20 years, changing its parameters (or "weights") with every match you play. You don't remember the matches, but they have left their legacy in the current set of weights.
    1. Computational cost - Input provokes an output after limited amount of computation. (In neural network, after same amount of computation for all inputs.)
    2. Memory cost - More limited. Machine may grow, but not by as much as when had to remember all experiences. (In neural network, data structure is always same size. Nothing changes except the values of the weights.)
Human playing chess: Computer playing chess:

Pattern recognition

Similar issue in pattern recognition (like vision) as in taking actions. Do we have explicit memories?

Consider this: You can easily recognise a 10 euro note (no.3 recogniser machine).
But you probably cannot draw a 10 euro note (no no.2 explicit memory).

What could a "no.3" program look like?

Imagine character recognition as a rule-based program:
IF pixel 0 = white AND pixel 1 = black AND ....
OR if pixel 0 = white AND pixel 1 = white AND ....
OR if ...
then return "3"

ELSE if pixel 0 = white AND ...
OR if ....
then return "5"

ELSE if ...
The neural network will, when finished, actually embody such a function.

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.