Express the 3 x 3
or "X's and O's") game
as a Reinforcement Learning problem.
The program will play against another copy of itself.
That is, at first you will be playing against
a random player.
As you learn, your opponent gets better.
- Define the space of states x.
e.g. Let x = (x1,..,x9) represent the 9 squares.
Each xi takes value:
0 (blank), 1 (me) or 2 (opponent).
Start at x = (0,0,0,0,0,0,0,0,0).
I make first move (one empty square becomes "1").
Then the opponent makes a move into an empty square (becomes "2").
Then I observe the new state.
And can take a new action, and so on.
Note from my viewpoint the world is probabilistic
(same action a in same x can lead to different y
after opponent moves).
(Opponent makes move)
- Define the space of actions a.
e.g. Let a = (a1).
a1 takes value 1 to 9 (meaning put "1" into one of the 9 squares).
- How to deal with illegal actions? (e.g. Put "1" into non-empty square.)
In different states x, different actions a are illegal.
Rather complex to define all these in advance.
- Agent learns that action leaves you in same state x,
ready to take action again.
a = getSuggestedAction(x)
until ( validAction(x,a) )
- Punish the action (e.g. lose game)
to make it learn not to take it.
- Define when a win happens.
Rather than define in advance all possible win states,
it might be easier to write a function "isWin(y)"
to check if any given state y is a win.
- Set it up so you can do a run (until end game, maximum of 9 steps)
and are ready to plug in a learning algorithm.
- Play it first with an algorithm that suggests random actions.
The above is for 40 percent. For more:
- Next, suggest actions based on Q-values.
- Define an array to hold Q(x,a) values.
Coding the state-space as a lookup-table.
Sample code for lookup-table Q-learning.
39 states = 19,683 states
We need 19,683 x 9 boxes for Q(x,a) table = 177,147 boxes.
- Define reward function.
Consider win, loss, draw, game not over.
Interim rewards? (e.g. For 2 in a row.)
- Start your Q-learning by modifying the Q-values as follows.
When you get a reward or punishment:
Q(x,a) := r
- Learn playing against itself.
- Demo it playing a random opponent.
Show how well it plays.
The above is for 80 percent. For more:
- Modify it to look into the future, so we can move towards a goal state
from far away:
r + γ c
where c = best Q-value in next state
- Modify it to deal with the fact that sometimes you get a reward / punishment,
sometimes you don't (because opponent is unpredictable).
Build up a running average of all feedback:
(1-α) Q(x,a) +
( r + γ c )
goes 1, 1/2, 1/3, ...
where n is the number of times you have updated this particular pair Q(x,a).
i.e. You need to keep another array n(x,a) to remember all these.
- Save Q-values to file, and read from file on startup,
so don't start blank each time program is run.
Not this year: Use WWM World
An alternative would be to avoid writing the World,
and write an RL Mind for the
by Szymon Zielinski
on the WWM server.
You cannot learn online because you would
need a persistent data structure
to hold Q-values,
so that you can learn from multiple runs.
See if you can learn your Q-values offline.
And then you can upload your final Mind with Q-values filled in
to run online.
To hand up:
- Write it in any language.
Hand up a commented printout of your program.
printout in landscape mode
(the best way to print code.)
- Hand up a report showing what your program does, with output and screen shots.
- Show Q-values half-way through learning.
Show Q-values at end.
Explain what it learned.
- Show stats of how well your program plays against a random player.
- Give your write-up direct to me or leave it for me in the school office.