Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

My big idea: Ancient Brain


CA114      CA170

CA668      CA669      Projects

RL - The world

A Reinforcement Learning (RL) agent senses a world, takes actions in it, and receives numeric rewards and punishments from some reward function based on the consequences of the actions it takes. By trial-and-error, the agent learns to take the actions which maximize its rewards.

The basic algorithm of the agent:

we build up estimates of Q(x,a) = how good action a is in state x
start with Q(x,a) random or zero

do for a long time
  observe x
  suggest a according to some algorithm
  execute a
  observe y
  receive reward r(y) or r(x,y)
  modify Q(x,a) according to some algorithm

a can be "no-op".

Note that having a reward every time step is actually no restriction. The classic delayed reinforcement problem is just a special case with, say, most rewards zero, and only a small number of infrequent rewards non-zero.

Sample definitions of x

Markov Decision Process (MDP)

A simple way of modelling an uncertain world is to model it as a Markov Decision Process (MDP).

The agent observes discrete states of the world x ( tex2html_wrap_inline6339 , a finite set) and can execute discrete actions a ( tex2html_wrap_inline6343 , a finite set). Each discrete time step, the agent observes state x, takes action a, observes new state y, and receives immediate reward r.

Transitions are probabilistic, that is, y and r are drawn from stationary probability distributions tex2html_wrap_inline6279 and tex2html_wrap_inline6281 , where tex2html_wrap_inline6279 is the probability that taking action a in state x will lead to state y and tex2html_wrap_inline6281 is the probability that taking action a in state x will generate reward r. We have tex2html_wrap_inline6377 and tex2html_wrap_inline6379 .

Probabilistic Transitions

Start in state x:

In state x,
taking action b gives a guaranteed reward of 1.5.
Taking action a normally leads to a reward of only 1, but sometimes leads to a reward of 5.

Pxa(y) = 0.8
Pxa(z) = 0.2
Pxa(w) = 0 for all w other than y,z
(probabilistic transition)

Pxb(s) = 1
Pxb(w) = 0 for all w other than s
(deterministic transition)

Markov and stationary

The transition probability can be viewed as a conditional probability. Where tex2html_wrap_inline6381 are the values of x,a at time t, the Markov property is expressed as:


and the stationary property is expressed as:


World may just look probabilistic

Deterministic world may just look probabilistic because of poor senses.
e.g. Imagine we sense direction but not distance.
Features of the world that are deterministic can actually appear probabilistic to the agent.
This is illustrated in Figure. When in the situation on the left, the agent will experience: tex2html_wrap_inline6893 , tex2html_wrap_inline6895 , tex2html_wrap_inline6897 . When in the situation on the right, the agent will experience: tex2html_wrap_inline6893 , tex2html_wrap_inline6895 , tex2html_wrap_inline6903 .

Probabilistic transitions tex2html_wrap_inline6279 .

Non-Markovian worlds

In fact the world above is not exactly an MDP.
P ( xt+1=bump   |   xt=st,   at=bt,   xt-1=st-1,   at-1=bt-1,   ... )

P ( bump | wall is N, move N, wall is N, move N, wall is N, move N )
is higher than:
P ( bump | wall is N, move N, wall not visible, move N, wall not visible, move N )

i.e. There isn't simply a constant:
P ( bump | wall is N, move N )

Non-stationary probabilistic transitions tex2html_wrap_inline6281 . The states' positions in the diagram correspond to their geographical positions. x is the state "food is EastNorthEast". y is the state "food is East". The agent sees both y1 and y2 as the same state y. The quantities on the arrows are the rewards for that transition.

tex2html_wrap_inline6907 leads to state y, and tex2html_wrap_inline6911 also leads to state y. But the probability of tex2html_wrap_inline6915 leading to an immediate reward depends on the agent's policy. If it tends to take action tex2html_wrap_inline6911 , the value of y will be higher than if it tends to take action tex2html_wrap_inline6907 . But it will not necessarily learn to take action tex2html_wrap_inline6911 as a result - that would just be normal learning, where it can tell the states apart. tex2html_wrap_inline6907 is just as likely a policy because tex2html_wrap_inline6907 is also rewarded when the value of y increases.

tex2html_wrap_inline6281 is non-stationary. In fact, the world is what is called a Partially-Observable MDP.

Q. What simple change would turn the above into a MDP (in fact, a deterministic MDP)?

Deterministic worlds

The simple deterministic world is a special case with all transition probabilities equal to 1 or 0. For any pair x,a, there will be a unique state tex2html_wrap_inline6391 and a unique reward tex2html_wrap_inline6393 such that:


Deterministic and probabilistic worlds

  1. Deterministic world that looks deterministic.
  2. Deterministic world that looks probabilistic.

  3. Probabilistic world that looks probabilistic.
  4. Probabilistic world that looks deterministic.

Action = "no-op"

The set of actions available may differ from state to state. Depending on what we mean by this, this may also possibly be represented within the model, as the special case of an action that when executed (in this particular state, not in general) does nothing:


for all actions a in the "unavailable" set for x .

Note this would not exactly be "do nothing" since you would still get the reward for staying in x. It could therefore be a useful action!

Expected Reward

In this:

How do we decide which action to take?
Look at expected reward:
Ea(r) = 0.2 (5) + 0.8 (1)
          = 1.8
Eb(r) = 1 (1.5)
          = 1.5
Action a is better.

What do the probabilities mean?
They mean that if you take a 100 times, you expect 20 5's and 80 1's. Total reward 180.
If you take b 100 times you expect 100 1.5's. Total reward 150.

Maximise E(r).

When we take action a in state x, the reward we expect to receive is:


rmin ≤ E(r) ≤ rmax

Rewards r are bounded by tex2html_wrap_inline6455 , where tex2html_wrap_inline6457 ( tex2html_wrap_inline6459 would be a system where the reward was the same no matter what action was taken. The agent would always behave randomly and would be of no use or interest). Hence for a given x,a, tex2html_wrap_inline6463 .

E(r) = p1 r1 + ... + pn rn
≥ p1 rmin + ... + pn rmin
= 1 . rmin

E(r) exists, E(y) doesn't exist

Imagine a state definition for a robot carrying blocks to put in a bin:
x = (x1,x2)
x1 = what you are carrying: 0=not carrying block, 1=carrying block
x2 = direction of bin, values 0-9

You are in state x = (1,5).
Imagine that if you take some action a, you go to state y with probabilities:

y               Pxa(y)          reward

(1,8)           0.4             10
(0,8)           0.3             5
(1,5)           0.2             1
(0,5)           0.1             0

You can calculate E(r).
Q. What is it?

But "E(y)" is meaningless.
Consider the sum of the y's times their probabilities
= (1,8) 0.4 + (0,8) 0.3 + (1,5) 0.2 + (0,5) 0.1
= (0.6, 7.1)

You are never in this state when you take a.
In fact, this state doesn't exist.
"0.6" doesn't have a meaning.
The state space is not a continuous space. It is a discrete space of points.


Typically, we don't reward for the correct action, r = r(x,a):

in state x
if took action b
  then reward 1
  reward 0
because we don't know the correct action! That's why we're learning.
Instead we reward on getting to some state (somehow, we don't know how), r = r(x,y):
in state x
if got to state z
  then reward 1
  reward 0

Writing r = r(x,y), the probability of a particular reward r is:

and the expected reward becomes:


That is, normally we never think in terms of tex2html_wrap_inline6281 - we only think in terms of tex2html_wrap_inline6279 .

e.g. In the following:

Taking action a in state x,
P(r=5) = 0.2
P(r=0) = 0.4
P(r=3) = 0.3 + 0.1
          = 0.4
Question - What is E(r)?

Reward on Continuous or Transition?

In some models, rewards are associated with states, rather than with transitions, that is, r = r(y). The agent is not just rewarded for arriving at state y - it is also rewarded continually for remaining in state y. This is obviously just a special case of r = r(x,y) with:


and hence:


Reward on transition to good state only: r = r(x,y).
e.g. x = not at waterhole, y = at waterhole. Problem - may leave state to come back in!

Reward for arriving and staying in good state: r = r(y).

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.