Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

My big idea: Ancient Brain


CA114      CA170

CA668      CA669      Projects

State-space control

Basic Idea of - Learning from Rewards

Instead of supervised learning (exemplars), we don't tell it correct "class" / action. Instead we give sporadic indirect feedback (a bit like "this classification was good/bad").

e.g. Move your muscles to play basketball. I can't articulate what instructions to send to your muscles / robot motors and in what order. But a child could sit there and tell you when you have scored a basket. In fact, even a machine could detect and automatically reward you when a basket is scored.

Robo-Hoops robot basketball competition (autonomous).

Rebound Rumble robot basketball competition (part autonomous, part remote-controlled).

Clocksin and Moore - Traffic Junction

Clocksin, William F. and Moore, Andrew W. (1989), Experiments in Adaptive State-Space Robotics, Proceedings of the 7th Conference of the Society for Artificial Intelligence and Simulation of Behaviour (AISB-89).

Translated into the terms we will be using:

  1. Observe state of the world x = (p,s)
    position and speed of car on main road
    p - 21 values
    s - 20 values
    x has 420 possible values

  2. Take action a = (c,n)
    c - which pedal - 2 values (accelerate, brake)
    n - how much (press pedal this hard) - 5 values
    10 possible actions a

  3. Observe if situation = not crossed, crossed, or collision.

Already we see typical things:

  1. Much more states than actions.
  2. Multi-dimensional both.
  3. Definition of x and a is very much under our control. Could make it more coarse-grained / fine-grained.

If tried out every possible action in every possible state, 4200 experiments to carry out.

Traditional Approach

Build model of Physics.
Take distance (p - junction)
Time for car to cover distance given speed s
Time it takes agent to cross road

Problems / Restrictions:

  1. Need model in first place. Need a controlled world. e.g. Factory environment.

  2. Model must be accurate.
    e.g. Dynamics of robot arm:

  3. World changes / Arm friction increases - Have to re-program.
    But programmer is long gone.

State-Space Approach

Look at consequences of actions.
"Let the world be its own model"
If action a worked, keep it.
If not, explore other action a2.
After many iterations, we learn the correct action patterns to any level of granularity.
And we never had to understand how the world worked!

We learn the mapping:

x, a -> y
initial state, action -> new state

  1. This approach will work whether we cross the road using wings, fins, or view the world through reverse glasses.

  2. Can adjust (re-learn) as world changes.

  3. More plausible that evolution could have worked this way (fill in the "boxes") rather than building physics models.

  4. Another reason to use state-space (or other) learning is simply when the task is tedious to program. Which may mean expensive to program - Programmers aren't free.

Learning adapting to actual laws of physics and body:
Faith, a dog born with no front legs. Learned to walk on two legs.

Can you do exhaustive search?

If one can do exhaustive search, you don't need RL or any complex learning.

More usual: Only have time to try some actions in some states.

Many mappings that we could learn:

x -> a

x,a -> y

Multiple y's:
e.g. If you are in state x and take action a
50% of the time you will end up in state y1
and 50% of the time you will end up in state y2

e.g. x = (7,5)
a = (1,5)
y1 = (6,5)
y2 = (7,5)

x,a -> quality or fitness

E(r) exists, E(y) doesn't exist

We can add rewards - to get "expected reward" (average reward you will get over many events).

Whereas adding states is meaningless:
"Expected next state = ½ (y1 + y2)"

In example above, ½ (y1 + y2) = (6 ½, 5)
Expected state?
If you take action a, do you ever go to this state?
Does this state even exist?

Clocksin and Moore paper

They mention the following without noting that both of these may be difficult:
  1. Find "neighbouring" states x.
  2. Get "average" of multiple actions a.
They make connections to what animals, children, adults do in:
  1. Play
  2. Sleep
  3. Dreaming

Writing a program to write a program

Machine writes a program x -> a only if we can think of a program that will write this program.

This may require restricting the domain. e.g. Below we will restrict ourselves to writing a stimulus-response program - well-understood model, our program will provably write an optimal solution.

Genetic Programming is a program to write any general-purpose program - Too far too fast?

Sample applications of Reinforcement Learning

Robot playing air hockey by Reinforcement Learning.
From Darrin C. Bentivegna.

Robot learning to flip pancakes by RL.
From Petar Kormushev.

Google DeepMind's Deep Q-learning (RL) playing Atari Breakout.

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.