Q_Learning to determine best path on the goal state

longmen2022

Hi everyone, I am using Python on Q_learning to determine the best action on each episode on the 4x4 board. I have a living_reward = -0.1 goal_reward = 100 forbidden_reward = 100 discount rate gamma = 0.1 learning rate alpha =0.5 greedy probability epsilon = 0.3 and max iteration = 10000. When there are two similar max q-value for the up and right actions, a clockwise priority for printing the final policy will be used (up, right, down, left). I feed the below input to my code

12 7 5 6 p

and it prints out the following output:

1 right
2 right
3 up
4 left
5 forbid
6 wall-square
7 goal
8 left
9 up
10 left
11 left
12 goal
13 right
14 left
15 left
16 left

However, I am looking for it to print out this output

1 right
2 right
3 up
4 up
5 forbid
6 wall-square
7 goal
8 up
9 up
10 up
11 up
12 goal
13 up
14 up
15 up
16 up

It goes wrong at index 4,8,10,13,14,15,16,17 For the second input

15 12 8 6 p

it prints out this

1 up
2 right
3 up
4 left
5 up
6 wall-square
7 up
8 forbid
9 up
10 up
11 up
12 goal
13 right
14 right
15 goal
16 left

while I am looking for this output

1 up
2 right
3 up
4 left
5 up
6 wall-square
7 up
8 forbid
9 up
10 up
11 up
12 goal
13 right
14 right
15 goal
16 up

It goes wrong at the 16 index. Instead of going up, it goes left. I wonder if anyone could advise what goes wrong with my code? I am also providing my codes below. Any advice would be very appreciated! Thanks

import random
import numpy as np
import enum

EACH_STEP_REWARD = -0.1
GOAL_SQUARE_REWARD = 100
FORBIDDEN_SQUARE_REWARD = -100
DISCOUNT_RATE_GAMMA = 0.1 # Discount Rate
LEARNING_RATE_ALPHA = 0.3 # Learning Rate
GREEDY_PROBABILITY_EPSILON = 0.5 # Greedy Probability
ITERATION_MAX_NUM = 10000 # Will be 10,000
START_LABEL = 2
LEVEL = 4

class Direction(enum.Enum):
up = 1
right = 2
down = 3
left = 0
exit = 4

class Node:
def __init__(self, title, next, Goal=False, Forbidden=False, Wall=False, qValues=None, actions=None):
self.title = title
self.next = next
self.qValues = [qValues] * 5
self.move = [actions] * 5
self.goal = Goal
self.forbidden = Forbidden
self.wall = Wall

def max\_Q\_value(self):
    if self.wall:
        return False
    max\_q = \[\]
    for q in self.qValues: