CA425 Practical

Express the 3 x 3 noughts-and-crosses (or "Tic-tac-toe" or "X's and O's") game as a Reinforcement Learning problem.

The program will play against another copy of itself. That is, at first you will be playing against a random player. As you learn, your opponent gets better.


  1. Define the space of states x.
    e.g. Let x = (x1,..,x9) represent the 9 squares.
    Each xi takes value: 0 (blank), 1 (me) or 2 (opponent).
    Start at x = (0,0,0,0,0,0,0,0,0).
    I make first move (one empty square becomes "1"). Then the opponent makes a move into an empty square (becomes "2"). Then I observe the new state. And can take a new action, and so on.
    Note from my viewpoint the world is probabilistic (same action a in same x can lead to different y after opponent moves).

    Observe x
    Take a
    (Opponent makes move)
    Observe y
    

  2. Define the space of actions a.
    e.g. Let a = (a1).
    a1 takes value 0 to 8 (meaning put "1" into one of the 9 squares).
    When an action is illegal (put "1" into non-empty square) you could define it as just leaving you in same state x, ready to take new action. Problem: Might prefer to stay in same state x forever if alternative is opponent wins!
    So you could punish the action instead (e.g. lose game) to make it definitely learn not to take it.
    In different states x, different actions a are illegal. Perhaps too complex to define all these in advance. We could just learn them all by trial and error.

  3. Define an array to hold Q(x,a) values.
    See Coding the state-space as a lookup-table.
    See Sample code for lookup-table Q-learning.
    39 states = 19,683 states
    We need 19,683 x 9 boxes for Q(x,a) table = 177,147 boxes.
    Run testEnumeration.

  4. Define the reward function.
    r for getting to one of 3 classes of lines.
    3 horizontal lines or 3 vertical lines or 2 diagonals.
    Rather than define in advance all possible states y where can get r, might be easier to just write a function "isGoal(state)" to check if any given state y is goal.
    Punishment if opponent wins.
    Zero if draw.

  5. Set it up so you can do a run (until end game, maximum of 9 steps) and can plug in any learning algorithm to modify the Q-values.

  6. Start by modifying the Q-values as follows. When you get a reward or punishment:
    Q(x,a) := r

  7. Demo it playing a random opponent. Show how well it plays.

The above is for 90 percent. For the last 10 percent do:

  1. First, modify it to deal with the fact that sometimes you get a reward / punishment, sometimes you don't (because opponent is unpredictable). Build up a running average of all feedback:
    Q(x,a) := (1-α) Q(x,a) + α r
    where α goes 1, 1/2, 1/3, ...
    i.e. α is 1/n, where n is the number of times you have updated this particular pair Q(x,a).
    i.e. You need to keep another array n(x,a) to remember all these.

  2. Then modify it to look into the future, so we can move towards a goal state from far away:
    Q(x,a) := (1-α) Q(x,a) + α ( r + γ c )
    where c = best Q-value in next state

  3. You could save Q-values to file, and read from file on startup, so don't start blank each time program is run.

  4. Demo it playing a random opponent. Show how well it plays.


To hand up:

  1. Write it in any language.
  2. Hand up a commented printout of your program.
    Colour, syntax-highlighted printout in landscape mode (the best way to print code.)
  3. Your program on CD.
  4. Hand up a report showing what your program does, with output and screen shots.
  5. Give it direct to me or leave it for me in the school office.