CA425 Practical
Express the 3 x 3
noughts-and-crosses
(or "Tic-tac-toe"
or "X's and O's") game
as a Reinforcement Learning problem.
Your "opponent" will be a random AI player.
- Define the space of states x.
e.g. Let x = (x1,..,x9) represent the 9 squares.
Each xi takes value:
0 (blank), 1 (me) or 2 (opponent).
Start at x = (0,0,0,0,0,0,0,0,0).
I make first move (one empty square becomes "1").
Then the opponent makes a random move into an empty square (becomes "2").
Then I observe the new state.
And can take a new action, and so on.
Note from my viewpoint the world is probabilistic
(same action a in same x can lead to different y
after opponent moves).
- Define the space of actions a.
e.g. Let a = (a1).
a1 takes value 0 to 8 (meaning put "1" into one of the 9 squares).
When an action is illegal (put "1" into non-empty square)
you could define it as just leaving you in same state x,
ready to take new action.
Problem: Might prefer to stay in same state x forever if alternative is
opponent wins!
So you could punish the action instead (e.g. lose game)
to make it definitely learn not to take it.
In different states x, different actions a are illegal.
Perhaps too complex to define all these in advance.
We could just learn them all by trial and error.
- Define an array to hold Q(x,a) values.
See
Coding the state-space as a lookup-table.
See
Sample code for lookup-table Q-learning.
39 states = 19,683 states
We need 19,683 x 9 boxes for Q(x,a) table = 177,147 boxes.
Run
testEnumeration.
- Define the reward function.
r for getting to one of 3 classes of lines.
3 horizontal lines or 3 vertical lines or 2 diagonals.
Rather than define in advance all possible states y where can get r,
might be easier to just write a function "isGoal(state)"
to check if any given state y is goal.
Punishment if opponent wins.
Zero if draw.
- Set it up so you can do a run (until end game, maximum of 9 steps)
and can plug in any learning algorithm to modify the Q-values.
- Start by modifying the Q-values as follows.
When you get a reward or punishment:
Q(x,a) := r
- Demo it.
Show how well it plays.
The above is for 90 percent. For the last 10 percent do:
- First, modify it to deal with the fact that sometimes you get a reward / punishment,
sometimes you don't (because opponent is random).
Build up a running average of all feedback:
Q(x,a) := (1-α) Q(x,a)
+ α r
where α
goes 1, 1/2, 1/3, ...
i.e.
α
is 1/n,
where n is the number of times you have updated this particular pair Q(x,a).
i.e. You need to keep another array n(x,a) to remember all these.
- Then modify it to look into the future, so we can move towards a goal state
from far away:
Q(x,a) := (1-α) Q(x,a) +
α
( r + γ c )
where c = best Q-value in next state
- Demo it now.
Show how well it plays.
To hand up:
- Write it in any language.
-
Hand up a commented printout of your program.
(Hint: Colour printout in landscape mode
is normally the best way to print out code.)
- Hand up a report showing what your program does, with output and screen shots.