2017-11-25 3 views
0

Ich lerne Q-Learning und versuche, einen Q-Lerner auf dem FrozenLake-v0-Problem in OpenAI Gym zu bauen. Da das Problem nur 16 Zustände und 4 mögliche Aktionen hat, sollte es ziemlich einfach sein, aber es sieht so aus, als würde mein Algorithmus die Q-Tabelle nicht korrekt aktualisieren.FrozenLake Q-Learning Update Problem

Das folgende ist mein Q-Lernalgorithmus:

import gym 
import numpy as np 
from gym import wrappers 


def run(
    env, 
    Qtable, 
    N_STEPS=10000, 
    alpha=0.2, # 1-alpha the learning rate 
    rar=0.4, # random exploration rate 
    radr=0.97 # decay rate 
): 

    # Initialize pars:: 
    TOTAL_REWARD = 0 
    done = False 
    action = env.action_space.sample() 
    state = env.reset() 

    for _ in range(N_STEPS): 
     if done: 
      print('TW', TOTAL_REWARD) 
      break 

     s_prime, reward, done, info = env.step(action) 
     # Update Q Table: 
     Qtable[state, action] = (1 - alpha) * Qtable[state, action] + alpha * (reward + Qtable[s_prime,np.argmax(Qtable[s_prime,])]) 

     # Prepare for the next step: 
     # Next New Action: 
     if rand.uniform(0, 1) < rar: 
      action = env.action_space.sample() 
     else: 
      action = np.argmax(Qtable[s_prime, :]) 

     # Update new state: 
     state = s_prime 
     # Update Decay: 
     rar *= radr 
     # Update Stats 
     TOTAL_REWARD += reward 
     if reward > 0: 
      print(reward) 

    return Qtable, TOTAL_REWARD 

Dann wird der Q-Lernende 1000 Iterationen laufen:

if __name__=="__main__": 
    # Required Pars: 
    N_ITER = 1000 
    REWARDS = [] 
    # Setup the Maze: 
    env = gym.make('FrozenLake-v0') 

    # Initialize Qtable: 
    num_actions = env.unwrapped.nA 
    num_states = env.unwrapped.nS 
    # Qtable = np.random.uniform(0, 1, size=num_states * num_actions).reshape((num_states, num_actions)) 
    Qtable = np.zeros((env.observation_space.n, env.action_space.n)) 

    for _ in range(N_ITER): 
     res = run(env, Qtable) 
     Qtable = res[0] 
     REWARDS.append(res[1]) 
    print(np.mean(REWARDS)) 

Jeder möglicher Rat geschätzt!

Antwort