强化学习算法 Sarsa 解迷宫游戏,代码逐条详解

2023-11-01

本文内容源自百度强化学习 7 日入门课程学习整理
感谢百度 PARL 团队李科浇老师的课程讲解

强化学习算法 Sarsa 解迷宫游戏

一、安装依赖库

安装强化学习算法中环境库 Gym

pip install gym

二、导入依赖库

import gym
import numpy as np
import time # 用于延时程序,方便渲染画面

三、智能体 Agent 的算法:Sarsa

  • 智能体 Agent 是和环境 environment 交互的主体
    • 包含了观察当前状态
    • 根据当前状态作出动作选择
    • 根据选择后的结果更新 Q 值表
  • predict() 方法:输入观察值 observation(或者说状态state),输出 “预测” 动作 action (最优动作)
    • 观察当前状态下,所有可以采用的 action 对应的 Q 值
    • 在其中选取最大的,组成一个列表
    • 该列表对应可能选取的最优动作列表
    • 在最优动作列表中随机选取一个动作
  • sample() 方法:在 predict() 方法基础上使用 ε-greedy 增加探索,输出 “实际” 动作 action
    • 采用 epsilon greedy 算法
    • 90% 概率选择最优动作
    • 10% 概率选择随机动作
  • learn() 方法:输入训练数据,完成一轮Q表格的更新
    • 更新的是之前状态 obs 下采取动作 action 后的 Q 值
    • 如果游戏结束,则 reward 为新的 Q 值
    • 如果游戏没有结束,则 reward 和下一步的 Q 值结合产生新的 Q 值
    • 同时用学习速率 lr 做更新约束
class SarsaAgent(object):
    def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
        self.act_n = act_n      # 动作维度,有几个动作可选
        self.lr = learning_rate # 学习率
        self.gamma = gamma      # 后面的 Q 值对前面的影响
        self.epsilon = e_greed  # 按一定概率随机选动作
        self.Q = np.zeros((obs_n, act_n))

    # 根据输入观察值,采样输出的动作值(带 10% 的探索)
    def sample(self, obs):
        if (np.random.uniform(0, 1) < 1 - self.epsilon): # 这里是 90% 可能性
            action = self.predict(obs) # 执行最优动作
        else: # 10% 的概率
            action = np.random.choice(self.act_n) # 执行随机动作
        return action

    # 根据输入观察值,预测输出的动作值
    def predict(self, obs):
        Q_list = self.Q[obs, :] # 获取当前状态下,作出所有动作,对应的 Q 值列表
        maxQ = np.max(Q_list) # 求列表中的最大值
        action_list = np.where(Q_list == maxQ)[0] # 最大 Q 值对应的动作即最优动作
        action = np.random.choice(action_list) # 随机选择一个最优动作
        return action

    # 学习方法,也就是更新Q-table的方法
    def learn(self, obs, action, reward, next_obs, next_action, done):
        """ on-policy
            obs: 交互前的obs, s_t
            action: 本次交互选择的action, a_t
            reward: 本次动作获得的奖励r
            next_obs: 本次交互后的obs, s_t+1
            next_action: 根据当前Q表格, 针对next_obs会选择的动作, a_t+1
            done: episode是否结束
        """
        predict_Q = self.Q[obs,action] # 交互前的状态下,选择的动作所对应 Q 值
        if (done): # 游戏结束
            target_Q = reward # 新的 Q 值为 reward
        else: # 游戏没有结束
            target_Q = reward + self.gamma * self.Q[next_obs, next_action]
            # 用 reward 和 交互后状态下,选择的下一个动作对应的 Q 值,综合得到新的 Q 值
        self.Q[obs,action] += self.lr * (target_Q - predict_Q) # 使用 lr 做修正更新的幅度

    # 保存Q表格数据到文件
    def save(self):
        npy_file = './q_table.npy'
        np.save(npy_file, self.Q)
        print(npy_file + ' saved.')
    
    # 从文件中读取数据到Q表格中
    def restore(self, npy_file='./q_table.npy'):
        self.Q = np.load(npy_file)
        print(npy_file + ' loaded.')

四、训练和测试语句

每一局游戏,记录下步数 total_steps 和总奖励 total_reward

每一步都更新 Q 值表

def run_episode(env, agent, render=False):
    total_steps = 0 # 记录每个episode走了多少step
    total_reward = 0 # 记录每一局游戏的总奖励

    obs = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)
    action = agent.sample(obs) # 根据算法选择一个动作

    while True:
        next_obs, reward, done, _ = env.step(action) # 与环境进行一个交互,执行动作
        next_action = agent.sample(next_obs) # 根据算法选择下一个动作
        # 训练 Sarsa 算法
        agent.learn(obs, action, reward, next_obs, next_action, done)
        # obs 执行动作前的状态,action 执行的动作,得到预测的 Q0
        # reward 执行动作后的奖励,next_obs 执行动作后的状态,next_action 选择的下一个动作,得到更新的 Q0
        # done 判断游戏是否结束
        

        action = next_action # 迭代新的动作
        obs = next_obs  # 存储上一个观察值,迭代新的状态
        total_reward += reward # 累计奖励
        total_steps += 1 # 计算step数
        if render: # 判断是否需要渲染图形显示
            env.render() #渲染新的一帧图形
        if done: # 游戏结束
            break # 跳出循环,即结束本局游戏
    return total_reward, total_steps # 返回总的奖励和总的步数

def test_episode(env, agent):
    total_reward = 0 # 记录总的奖励
    obs = env.reset() # 重置环境,obs 初始观察值,即初始状态
    while True:
        action = agent.predict(obs) # greedy,每次选择最优动作
        next_obs, reward, done, _ = env.step(action) # 交互后,获取新的状态,奖励,游戏是否结束
        total_reward += reward # 累计奖励
        obs = next_obs # 迭代更新状态
        time.sleep(0.5) # 休眠,以便于我们观察渲染的图形
        env.render() # 渲染图形显示
        if done: # 游戏结束
            break # 跳出循环
    return total_reward # 返回最终累计奖励

五、创建环境,实例化Agent,启动训练和测试

使用 Gym 库创建我们需要的环境

实例化 SarsaAgent 类,创建一个 Agent 对象,同时设定超参数

训练 500 局游戏,查看每一局游戏的结果

训练结束后进行测试

# 使用gym创建迷宫环境,设置is_slippery为False降低环境难度
env = gym.make("FrozenLake-v0", is_slippery=False)  # 0 left, 1 down, 2 right, 3 up
# 使用 make 方法创建需要的环境

# 创建一个agent实例,输入超参数
agent = SarsaAgent(
        obs_n=env.observation_space.n, # 16 个状态代表这个环境中 4*4 一共 16 个格子
        act_n=env.action_space.n, # 4 种动作选择:0 left, 1 down, 2 right, 3 up
        learning_rate=0.1, # 学习速率
        gamma=0.9, # 下一步的影响率
        e_greed=0.1) # 随机选择概率


# 训练500个episode,打印每个episode的分数
for episode in range(500):
    ep_reward, ep_steps = run_episode(env, agent, False)
    print('Episode %s: steps = %s , reward = %.1f' % (episode, ep_steps, ep_reward))

# 全部训练结束,查看算法效果
test_reward = test_episode(env, agent)
print('test reward = %.1f' % (test_reward))

运行结果:

Episode 0: steps = 6 , reward = 0.0
Episode 1: steps = 17 , reward = 0.0
Episode 2: steps = 9 , reward = 0.0
Episode 3: steps = 2 , reward = 0.0
Episode 4: steps = 8 , reward = 0.0
Episode 5: steps = 8 , reward = 0.0
Episode 6: steps = 14 , reward = 0.0
Episode 7: steps = 7 , reward = 0.0
Episode 8: steps = 7 , reward = 0.0
Episode 9: steps = 2 , reward = 0.0
Episode 10: steps = 3 , reward = 0.0
Episode 11: steps = 8 , reward = 0.0
Episode 12: steps = 3 , reward = 0.0
Episode 13: steps = 8 , reward = 0.0
Episode 14: steps = 6 , reward = 0.0
Episode 15: steps = 5 , reward = 0.0
Episode 16: steps = 5 , reward = 0.0
Episode 17: steps = 7 , reward = 0.0
Episode 18: steps = 2 , reward = 0.0
Episode 19: steps = 7 , reward = 0.0
Episode 20: steps = 2 , reward = 0.0
Episode 21: steps = 7 , reward = 0.0
Episode 22: steps = 6 , reward = 0.0
Episode 23: steps = 3 , reward = 0.0
Episode 24: steps = 4 , reward = 0.0
Episode 25: steps = 4 , reward = 0.0
Episode 26: steps = 17 , reward = 0.0
Episode 27: steps = 11 , reward = 0.0
Episode 28: steps = 4 , reward = 0.0
Episode 29: steps = 9 , reward = 0.0
Episode 30: steps = 3 , reward = 0.0
Episode 31: steps = 11 , reward = 0.0
Episode 32: steps = 7 , reward = 0.0
Episode 33: steps = 3 , reward = 0.0
Episode 34: steps = 16 , reward = 0.0
Episode 35: steps = 10 , reward = 0.0
Episode 36: steps = 2 , reward = 0.0
Episode 37: steps = 9 , reward = 0.0
Episode 38: steps = 9 , reward = 0.0
Episode 39: steps = 19 , reward = 1.0
Episode 40: steps = 6 , reward = 0.0
Episode 41: steps = 6 , reward = 0.0
Episode 42: steps = 7 , reward = 0.0
Episode 43: steps = 4 , reward = 0.0
Episode 44: steps = 4 , reward = 0.0
Episode 45: steps = 5 , reward = 0.0
Episode 46: steps = 4 , reward = 0.0
Episode 47: steps = 22 , reward = 1.0
Episode 48: steps = 2 , reward = 0.0
Episode 49: steps = 2 , reward = 0.0
Episode 50: steps = 2 , reward = 0.0
Episode 51: steps = 17 , reward = 0.0
Episode 52: steps = 14 , reward = 0.0
Episode 53: steps = 6 , reward = 0.0
Episode 54: steps = 8 , reward = 0.0
Episode 55: steps = 18 , reward = 0.0
Episode 56: steps = 5 , reward = 0.0
Episode 57: steps = 2 , reward = 0.0
Episode 58: steps = 8 , reward = 0.0
Episode 59: steps = 4 , reward = 0.0
Episode 60: steps = 10 , reward = 0.0
Episode 61: steps = 2 , reward = 0.0
Episode 62: steps = 11 , reward = 0.0
Episode 63: steps = 21 , reward = 0.0
Episode 64: steps = 4 , reward = 0.0
Episode 65: steps = 2 , reward = 0.0
Episode 66: steps = 3 , reward = 0.0
Episode 67: steps = 3 , reward = 0.0
Episode 68: steps = 18 , reward = 1.0
Episode 69: steps = 6 , reward = 0.0
Episode 70: steps = 8 , reward = 0.0
Episode 71: steps = 8 , reward = 0.0
Episode 72: steps = 4 , reward = 0.0
Episode 73: steps = 13 , reward = 0.0
Episode 74: steps = 3 , reward = 0.0
Episode 75: steps = 7 , reward = 0.0
Episode 76: steps = 8 , reward = 0.0
Episode 77: steps = 3 , reward = 0.0
Episode 78: steps = 7 , reward = 0.0
Episode 79: steps = 8 , reward = 0.0
Episode 80: steps = 7 , reward = 0.0
Episode 81: steps = 10 , reward = 1.0
Episode 82: steps = 6 , reward = 1.0
Episode 83: steps = 9 , reward = 1.0
Episode 84: steps = 6 , reward = 0.0
Episode 85: steps = 6 , reward = 1.0
Episode 86: steps = 3 , reward = 0.0
Episode 87: steps = 7 , reward = 1.0
Episode 88: steps = 6 , reward = 1.0
Episode 89: steps = 7 , reward = 1.0
Episode 90: steps = 6 , reward = 1.0
Episode 91: steps = 6 , reward = 1.0
Episode 92: steps = 10 , reward = 1.0
Episode 93: steps = 6 , reward = 1.0
Episode 94: steps = 8 , reward = 1.0
Episode 95: steps = 6 , reward = 1.0
Episode 96: steps = 7 , reward = 1.0
Episode 97: steps = 6 , reward = 1.0
Episode 98: steps = 6 , reward = 1.0
Episode 99: steps = 8 , reward = 1.0
Episode 100: steps = 6 , reward = 1.0
Episode 101: steps = 8 , reward = 1.0
Episode 102: steps = 6 , reward = 1.0
Episode 103: steps = 6 , reward = 1.0
Episode 104: steps = 6 , reward = 1.0
Episode 105: steps = 8 , reward = 1.0
Episode 106: steps = 6 , reward = 1.0
Episode 107: steps = 6 , reward = 1.0
Episode 108: steps = 6 , reward = 1.0
Episode 109: steps = 6 , reward = 1.0
Episode 110: steps = 4 , reward = 0.0
Episode 111: steps = 6 , reward = 1.0
Episode 112: steps = 6 , reward = 1.0
Episode 113: steps = 6 , reward = 1.0
Episode 114: steps = 6 , reward = 1.0
Episode 115: steps = 7 , reward = 1.0
Episode 116: steps = 7 , reward = 1.0
Episode 117: steps = 10 , reward = 1.0
Episode 118: steps = 5 , reward = 0.0
Episode 119: steps = 6 , reward = 1.0
Episode 120: steps = 3 , reward = 0.0
Episode 121: steps = 6 , reward = 1.0
Episode 122: steps = 6 , reward = 1.0
Episode 123: steps = 9 , reward = 1.0
Episode 124: steps = 6 , reward = 1.0
Episode 125: steps = 5 , reward = 0.0
Episode 126: steps = 6 , reward = 1.0
Episode 127: steps = 6 , reward = 1.0
Episode 128: steps = 8 , reward = 1.0
Episode 129: steps = 6 , reward = 1.0
Episode 130: steps = 6 , reward = 1.0
Episode 131: steps = 8 , reward = 1.0
Episode 132: steps = 8 , reward = 1.0
Episode 133: steps = 6 , reward = 1.0
Episode 134: steps = 6 , reward = 1.0
Episode 135: steps = 6 , reward = 1.0
Episode 136: steps = 6 , reward = 1.0
Episode 137: steps = 6 , reward = 1.0
Episode 138: steps = 6 , reward = 1.0
Episode 139: steps = 4 , reward = 0.0
Episode 140: steps = 6 , reward = 1.0
Episode 141: steps = 6 , reward = 1.0
Episode 142: steps = 6 , reward = 1.0
Episode 143: steps = 9 , reward = 1.0
Episode 144: steps = 6 , reward = 1.0
Episode 145: steps = 6 , reward = 1.0
Episode 146: steps = 6 , reward = 1.0
Episode 147: steps = 7 , reward = 1.0
Episode 148: steps = 7 , reward = 1.0
Episode 149: steps = 6 , reward = 1.0
Episode 150: steps = 6 , reward = 1.0
Episode 151: steps = 6 , reward = 1.0
Episode 152: steps = 7 , reward = 1.0
Episode 153: steps = 6 , reward = 1.0
Episode 154: steps = 6 , reward = 1.0
Episode 155: steps = 7 , reward = 1.0
Episode 156: steps = 7 , reward = 1.0
Episode 157: steps = 7 , reward = 1.0
Episode 158: steps = 6 , reward = 1.0
Episode 159: steps = 6 , reward = 1.0
Episode 160: steps = 6 , reward = 1.0
Episode 161: steps = 4 , reward = 0.0
Episode 162: steps = 6 , reward = 1.0
Episode 163: steps = 5 , reward = 0.0
Episode 164: steps = 6 , reward = 1.0
Episode 165: steps = 6 , reward = 1.0
Episode 166: steps = 6 , reward = 1.0
Episode 167: steps = 6 , reward = 1.0
Episode 168: steps = 9 , reward = 1.0
Episode 169: steps = 6 , reward = 1.0
Episode 170: steps = 8 , reward = 1.0
Episode 171: steps = 6 , reward = 1.0
Episode 172: steps = 6 , reward = 1.0
Episode 173: steps = 6 , reward = 1.0
Episode 174: steps = 6 , reward = 1.0
Episode 175: steps = 6 , reward = 1.0
Episode 176: steps = 6 , reward = 1.0
Episode 177: steps = 6 , reward = 1.0
Episode 178: steps = 8 , reward = 1.0
Episode 179: steps = 6 , reward = 1.0
Episode 180: steps = 6 , reward = 1.0
Episode 181: steps = 3 , reward = 0.0
Episode 182: steps = 6 , reward = 1.0
Episode 183: steps = 6 , reward = 1.0
Episode 184: steps = 6 , reward = 1.0
Episode 185: steps = 8 , reward = 1.0
Episode 186: steps = 10 , reward = 1.0
Episode 187: steps = 8 , reward = 1.0
Episode 188: steps = 6 , reward = 1.0
Episode 189: steps = 6 , reward = 1.0
Episode 190: steps = 6 , reward = 1.0
Episode 191: steps = 6 , reward = 1.0
Episode 192: steps = 7 , reward = 1.0
Episode 193: steps = 6 , reward = 1.0
Episode 194: steps = 6 , reward = 1.0
Episode 195: steps = 8 , reward = 1.0
Episode 196: steps = 6 , reward = 1.0
Episode 197: steps = 4 , reward = 0.0
Episode 198: steps = 5 , reward = 0.0
Episode 199: steps = 6 , reward = 1.0
Episode 200: steps = 6 , reward = 1.0
Episode 201: steps = 6 , reward = 1.0
Episode 202: steps = 4 , reward = 0.0
Episode 203: steps = 8 , reward = 1.0
Episode 204: steps = 8 , reward = 1.0
Episode 205: steps = 7 , reward = 1.0
Episode 206: steps = 6 , reward = 1.0
Episode 207: steps = 6 , reward = 1.0
Episode 208: steps = 6 , reward = 1.0
Episode 209: steps = 8 , reward = 1.0
Episode 210: steps = 7 , reward = 1.0
Episode 211: steps = 6 , reward = 1.0
Episode 212: steps = 6 , reward = 1.0
Episode 213: steps = 10 , reward = 1.0
Episode 214: steps = 6 , reward = 1.0
Episode 215: steps = 6 , reward = 1.0
Episode 216: steps = 6 , reward = 1.0
Episode 217: steps = 6 , reward = 1.0
Episode 218: steps = 6 , reward = 1.0
Episode 219: steps = 6 , reward = 1.0
Episode 220: steps = 6 , reward = 1.0
Episode 221: steps = 7 , reward = 1.0
Episode 222: steps = 6 , reward = 1.0
Episode 223: steps = 6 , reward = 1.0
Episode 224: steps = 6 , reward = 1.0
Episode 225: steps = 6 , reward = 1.0
Episode 226: steps = 6 , reward = 1.0
Episode 227: steps = 6 , reward = 1.0
Episode 228: steps = 7 , reward = 1.0
Episode 229: steps = 6 , reward = 1.0
Episode 230: steps = 6 , reward = 1.0
Episode 231: steps = 10 , reward = 1.0
Episode 232: steps = 6 , reward = 1.0
Episode 233: steps = 6 , reward = 1.0
Episode 234: steps = 6 , reward = 1.0
Episode 235: steps = 8 , reward = 1.0
Episode 236: steps = 6 , reward = 1.0
Episode 237: steps = 6 , reward = 1.0
Episode 238: steps = 6 , reward = 1.0
Episode 239: steps = 8 , reward = 1.0
Episode 240: steps = 6 , reward = 1.0
Episode 241: steps = 6 , reward = 1.0
Episode 242: steps = 8 , reward = 1.0
Episode 243: steps = 2 , reward = 0.0
Episode 244: steps = 6 , reward = 1.0
Episode 245: steps = 6 , reward = 1.0
Episode 246: steps = 6 , reward = 1.0
Episode 247: steps = 6 , reward = 1.0
Episode 248: steps = 6 , reward = 1.0
Episode 249: steps = 6 , reward = 1.0
Episode 250: steps = 7 , reward = 1.0
Episode 251: steps = 6 , reward = 1.0
Episode 252: steps = 2 , reward = 0.0
Episode 253: steps = 6 , reward = 1.0
Episode 254: steps = 6 , reward = 1.0
Episode 255: steps = 6 , reward = 1.0
Episode 256: steps = 8 , reward = 1.0
Episode 257: steps = 6 , reward = 1.0
Episode 258: steps = 6 , reward = 1.0
Episode 259: steps = 7 , reward = 1.0
Episode 260: steps = 6 , reward = 1.0
Episode 261: steps = 6 , reward = 1.0
Episode 262: steps = 7 , reward = 1.0
Episode 263: steps = 6 , reward = 1.0
Episode 264: steps = 6 , reward = 1.0
Episode 265: steps = 6 , reward = 1.0
Episode 266: steps = 6 , reward = 1.0
Episode 267: steps = 7 , reward = 1.0
Episode 268: steps = 6 , reward = 1.0
Episode 269: steps = 6 , reward = 1.0
Episode 270: steps = 6 , reward = 1.0
Episode 271: steps = 6 , reward = 1.0
Episode 272: steps = 6 , reward = 1.0
Episode 273: steps = 7 , reward = 1.0
Episode 274: steps = 3 , reward = 0.0
Episode 275: steps = 8 , reward = 1.0
Episode 276: steps = 7 , reward = 1.0
Episode 277: steps = 4 , reward = 0.0
Episode 278: steps = 6 , reward = 1.0
Episode 279: steps = 4 , reward = 0.0
Episode 280: steps = 7 , reward = 1.0
Episode 281: steps = 6 , reward = 1.0
Episode 282: steps = 6 , reward = 1.0
Episode 283: steps = 6 , reward = 1.0
Episode 284: steps = 6 , reward = 1.0
Episode 285: steps = 7 , reward = 1.0
Episode 286: steps = 8 , reward = 1.0
Episode 287: steps = 6 , reward = 1.0
Episode 288: steps = 5 , reward = 0.0
Episode 289: steps = 8 , reward = 1.0
Episode 290: steps = 7 , reward = 1.0
Episode 291: steps = 8 , reward = 1.0
Episode 292: steps = 4 , reward = 0.0
Episode 293: steps = 6 , reward = 1.0
Episode 294: steps = 9 , reward = 1.0
Episode 295: steps = 6 , reward = 1.0
Episode 296: steps = 6 , reward = 1.0
Episode 297: steps = 6 , reward = 0.0
Episode 298: steps = 6 , reward = 1.0
Episode 299: steps = 6 , reward = 1.0
Episode 300: steps = 6 , reward = 1.0
Episode 301: steps = 5 , reward = 0.0
Episode 302: steps = 6 , reward = 1.0
Episode 303: steps = 7 , reward = 1.0
Episode 304: steps = 6 , reward = 1.0
Episode 305: steps = 8 , reward = 1.0
Episode 306: steps = 6 , reward = 1.0
Episode 307: steps = 6 , reward = 1.0
Episode 308: steps = 6 , reward = 1.0
Episode 309: steps = 6 , reward = 1.0
Episode 310: steps = 4 , reward = 0.0
Episode 311: steps = 7 , reward = 1.0
Episode 312: steps = 8 , reward = 1.0
Episode 313: steps = 7 , reward = 1.0
Episode 314: steps = 6 , reward = 1.0
Episode 315: steps = 6 , reward = 1.0
Episode 316: steps = 7 , reward = 1.0
Episode 317: steps = 6 , reward = 1.0
Episode 318: steps = 6 , reward = 1.0
Episode 319: steps = 6 , reward = 1.0
Episode 320: steps = 6 , reward = 1.0
Episode 321: steps = 6 , reward = 1.0
Episode 322: steps = 7 , reward = 1.0
Episode 323: steps = 6 , reward = 1.0
Episode 324: steps = 6 , reward = 1.0
Episode 325: steps = 6 , reward = 1.0
Episode 326: steps = 6 , reward = 1.0
Episode 327: steps = 6 , reward = 1.0
Episode 328: steps = 6 , reward = 1.0
Episode 329: steps = 6 , reward = 1.0
Episode 330: steps = 6 , reward = 1.0
Episode 331: steps = 6 , reward = 1.0
Episode 332: steps = 6 , reward = 1.0
Episode 333: steps = 6 , reward = 1.0
Episode 334: steps = 3 , reward = 0.0
Episode 335: steps = 6 , reward = 1.0
Episode 336: steps = 6 , reward = 1.0
Episode 337: steps = 4 , reward = 0.0
Episode 338: steps = 6 , reward = 1.0
Episode 339: steps = 8 , reward = 1.0
Episode 340: steps = 6 , reward = 1.0
Episode 341: steps = 6 , reward = 1.0
Episode 342: steps = 6 , reward = 1.0
Episode 343: steps = 6 , reward = 1.0
Episode 344: steps = 6 , reward = 1.0
Episode 345: steps = 6 , reward = 1.0
Episode 346: steps = 6 , reward = 1.0
Episode 347: steps = 6 , reward = 1.0
Episode 348: steps = 6 , reward = 1.0
Episode 349: steps = 6 , reward = 1.0
Episode 350: steps = 6 , reward = 1.0
Episode 351: steps = 7 , reward = 1.0
Episode 352: steps = 6 , reward = 1.0
Episode 353: steps = 10 , reward = 1.0
Episode 354: steps = 3 , reward = 0.0
Episode 355: steps = 7 , reward = 1.0
Episode 356: steps = 7 , reward = 1.0
Episode 357: steps = 6 , reward = 1.0
Episode 358: steps = 2 , reward = 0.0
Episode 359: steps = 6 , reward = 1.0
Episode 360: steps = 6 , reward = 1.0
Episode 361: steps = 6 , reward = 1.0
Episode 362: steps = 7 , reward = 1.0
Episode 363: steps = 8 , reward = 1.0
Episode 364: steps = 6 , reward = 1.0
Episode 365: steps = 2 , reward = 0.0
Episode 366: steps = 6 , reward = 1.0
Episode 367: steps = 5 , reward = 0.0
Episode 368: steps = 6 , reward = 1.0
Episode 369: steps = 6 , reward = 1.0
Episode 370: steps = 6 , reward = 1.0
Episode 371: steps = 6 , reward = 1.0
Episode 372: steps = 6 , reward = 1.0
Episode 373: steps = 6 , reward = 1.0
Episode 374: steps = 8 , reward = 1.0
Episode 375: steps = 9 , reward = 1.0
Episode 376: steps = 6 , reward = 0.0
Episode 377: steps = 6 , reward = 1.0
Episode 378: steps = 6 , reward = 1.0
Episode 379: steps = 8 , reward = 1.0
Episode 380: steps = 6 , reward = 1.0
Episode 381: steps = 6 , reward = 1.0
Episode 382: steps = 6 , reward = 1.0
Episode 383: steps = 6 , reward = 1.0
Episode 384: steps = 6 , reward = 1.0
Episode 385: steps = 6 , reward = 1.0
Episode 386: steps = 8 , reward = 1.0
Episode 387: steps = 6 , reward = 1.0
Episode 388: steps = 6 , reward = 1.0
Episode 389: steps = 2 , reward = 0.0
Episode 390: steps = 6 , reward = 1.0
Episode 391: steps = 6 , reward = 1.0
Episode 392: steps = 6 , reward = 1.0
Episode 393: steps = 6 , reward = 1.0
Episode 394: steps = 7 , reward = 1.0
Episode 395: steps = 6 , reward = 1.0
Episode 396: steps = 6 , reward = 1.0
Episode 397: steps = 6 , reward = 1.0
Episode 398: steps = 6 , reward = 1.0
Episode 399: steps = 7 , reward = 1.0
Episode 400: steps = 6 , reward = 1.0
Episode 401: steps = 6 , reward = 1.0
Episode 402: steps = 6 , reward = 1.0
Episode 403: steps = 6 , reward = 1.0
Episode 404: steps = 8 , reward = 1.0
Episode 405: steps = 6 , reward = 1.0
Episode 406: steps = 6 , reward = 1.0
Episode 407: steps = 6 , reward = 1.0
Episode 408: steps = 6 , reward = 1.0
Episode 409: steps = 6 , reward = 1.0
Episode 410: steps = 6 , reward = 1.0
Episode 411: steps = 6 , reward = 1.0
Episode 412: steps = 6 , reward = 1.0
Episode 413: steps = 6 , reward = 1.0
Episode 414: steps = 6 , reward = 1.0
Episode 415: steps = 9 , reward = 1.0
Episode 416: steps = 6 , reward = 1.0
Episode 417: steps = 4 , reward = 0.0
Episode 418: steps = 6 , reward = 1.0
Episode 419: steps = 6 , reward = 1.0
Episode 420: steps = 7 , reward = 1.0
Episode 421: steps = 6 , reward = 1.0
Episode 422: steps = 6 , reward = 1.0
Episode 423: steps = 10 , reward = 1.0
Episode 424: steps = 6 , reward = 1.0
Episode 425: steps = 6 , reward = 1.0
Episode 426: steps = 8 , reward = 1.0
Episode 427: steps = 6 , reward = 1.0
Episode 428: steps = 9 , reward = 1.0
Episode 429: steps = 6 , reward = 1.0
Episode 430: steps = 4 , reward = 0.0
Episode 431: steps = 6 , reward = 1.0
Episode 432: steps = 6 , reward = 1.0
Episode 433: steps = 6 , reward = 1.0
Episode 434: steps = 6 , reward = 1.0
Episode 435: steps = 8 , reward = 1.0
Episode 436: steps = 6 , reward = 1.0
Episode 437: steps = 6 , reward = 1.0
Episode 438: steps = 6 , reward = 1.0
Episode 439: steps = 8 , reward = 1.0
Episode 440: steps = 2 , reward = 0.0
Episode 441: steps = 6 , reward = 1.0
Episode 442: steps = 10 , reward = 1.0
Episode 443: steps = 6 , reward = 1.0
Episode 444: steps = 6 , reward = 1.0
Episode 445: steps = 8 , reward = 1.0
Episode 446: steps = 6 , reward = 1.0
Episode 447: steps = 6 , reward = 1.0
Episode 448: steps = 5 , reward = 0.0
Episode 449: steps = 6 , reward = 1.0
Episode 450: steps = 8 , reward = 1.0
Episode 451: steps = 6 , reward = 1.0
Episode 452: steps = 8 , reward = 1.0
Episode 453: steps = 8 , reward = 1.0
Episode 454: steps = 7 , reward = 1.0
Episode 455: steps = 5 , reward = 0.0
Episode 456: steps = 6 , reward = 1.0
Episode 457: steps = 6 , reward = 1.0
Episode 458: steps = 8 , reward = 1.0
Episode 459: steps = 8 , reward = 1.0
Episode 460: steps = 10 , reward = 1.0
Episode 461: steps = 8 , reward = 1.0
Episode 462: steps = 7 , reward = 1.0
Episode 463: steps = 7 , reward = 1.0
Episode 464: steps = 6 , reward = 1.0
Episode 465: steps = 6 , reward = 1.0
Episode 466: steps = 6 , reward = 1.0
Episode 467: steps = 6 , reward = 1.0
Episode 468: steps = 6 , reward = 1.0
Episode 469: steps = 6 , reward = 1.0
Episode 470: steps = 3 , reward = 0.0
Episode 471: steps = 7 , reward = 1.0
Episode 472: steps = 6 , reward = 1.0
Episode 473: steps = 6 , reward = 1.0
Episode 474: steps = 7 , reward = 1.0
Episode 475: steps = 6 , reward = 1.0
Episode 476: steps = 8 , reward = 1.0
Episode 477: steps = 6 , reward = 1.0
Episode 478: steps = 6 , reward = 1.0
Episode 479: steps = 6 , reward = 1.0
Episode 480: steps = 6 , reward = 1.0
Episode 481: steps = 6 , reward = 1.0
Episode 482: steps = 6 , reward = 1.0
Episode 483: steps = 6 , reward = 1.0
Episode 484: steps = 5 , reward = 0.0
Episode 485: steps = 6 , reward = 1.0
Episode 486: steps = 9 , reward = 1.0
Episode 487: steps = 7 , reward = 1.0
Episode 488: steps = 6 , reward = 1.0
Episode 489: steps = 6 , reward = 1.0
Episode 490: steps = 6 , reward = 1.0
Episode 491: steps = 6 , reward = 1.0
Episode 492: steps = 9 , reward = 1.0
Episode 493: steps = 6 , reward = 1.0
Episode 494: steps = 6 , reward = 1.0
Episode 495: steps = 9 , reward = 1.0
Episode 496: steps = 6 , reward = 1.0
Episode 497: steps = 6 , reward = 1.0
Episode 498: steps = 6 , reward = 1.0
Episode 499: steps = 7 , reward = 1.0
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
test reward = 1.0

五、结果分析

我们可以查看下最终训练完成的 Q 表:

print(agent.Q)

运行结果:

[[0.27140285 0.4364344  0.09145568 0.15201279]
 [0.26813138 0.         0.         0.00945424]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.26636559 0.51632351 0.         0.13684245]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.33346755 0.         0.68004322 0.31572772]
 [0.26970648 0.77477987 0.35436455 0.        ]
 [0.04662094 0.73217092 0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.39939922 0.88159607 0.11581402]
 [0.4472322  0.72976712 1.         0.40947544]
 [0.         0.         0.         0.        ]]

16 个格子对应的情况:

SFFF
FHFH
FFFH
HFFG

其中 S 代表起点,F 代表平地,H 代表陷阱(掉进去游戏结束),G 代表终点(到达则获胜)

每个格子的排序序号:

0  1  2  3
4  5  6  7
8  9  10 11
12 13 14 15

所以测试开始后,首先在第 0 格,这个时候的 4 个动作对应的 Q 值是:

[0.27140285 0.4364344  0.09145568 0.15201279]

这 4 个 Q 值对应:0 left,1 down,2 right,3 up

所以最大值 0.4364344 对应的是 1,即动作为往下走一格

这个时候到达了第 4 个格子:

[0.26636559 0.51632351 0.         0.13684245]

选择 1,动作:down,到达第 8 个格子:

[0.33346755 0.         0.68004322 0.31572772]

选择 2,动作:right,到达第 9 个格子:

[0.26970648 0.77477987 0.35436455 0.        ]

选择 1,动作:down,到达第 13 个格子:

[0.         0.39939922 0.88159607 0.11581402]

选择 2,动作 right,到达第 14 个格子:

[0.4472322  0.72976712 1.         0.40947544]

选择 2,动作 right,到达第 15 个格子:终点!

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

强化学习算法 Sarsa 解迷宫游戏,代码逐条详解 的相关文章

随机推荐

  • Qt实现打开网页

    Qt实现打开网页 新建一个mainwindow 在UI界面添加一个Text Browser 首先在myHTTP pro中添加QT network 在mainwindow h中新建两个类 QNetworkReply和QNetworkAcces
  • 35. 实战:Python实现视频去水印(文末源码)

    目录 前言 目的 思路 代码实现 1 请求URL 查看源代码 2 源代码中没有就去抓包工具 3 拿到视频源链接 继续检索来源 4 拿到数据和链接 二进制写入到本地 完整源码 运行效果 总结 前言 我们在刷某短视频平台时 有些视频我们想保存到
  • 系统安装部署系列教程(六):封装系统

    终于到了本系列的最核心一篇教程了 在这篇教程里我们来看看如何按需来封装系统 封装系统有很多作用 硬件厂商需要将自己的特性软件和驱动程序预装到系统中 企业用户需要集成KMS激活服务器 装机人员需要预装用户的常用软件 所有这些功能 都可以通过封
  • yearning

    Yearning 开发模式 手动部署 如有侵权 请联系我删除 环境准备 MySQL https www cnblogs com xinjing jingxin p 8025805 html Yearning git clone https
  • 【Matlab基础】一些常用函数收集

    stem函数 1 用法 stem Y 将数据序列Y从x轴到数据值按照茎状形式画出 以圆圈终止 如果Y是一个矩阵 则将其每一列按照分隔方式画出 stem X Y 在X的指定点处画出数据序列Y stem filled 以实心的方式画出茎秆 st
  • yolov6 win10环境配置详细过程

    提示 文章写完后 目录可以自动生成 如何生成可参考右边的帮助文档 文章目录 前言 一 yolov6 下载 二 环境配置 三 测试环境 四 报错集合 前言 提示 这里可以添加本文要记录的大概内容 最近美团开源了yolov6 源码 准备体验下y
  • 韩国KT/LG/SK机房服务器比较

    众所周知 韩国就KT LG SK机房比较出名 那么三者之间有和区别呢 小编带大家分析一下 如有不对的地方还请多多指教 一 KT机房 KT机房采用中韩CN2专线与联通移动BGP线路 线路稳定不掉包 三网用户访问速度快而且速度和国内服务器没什么
  • 关于VAE中KL散度项的推导

    关于VAE中KL散度项的推导 最近在看 Variational AutoEncoder 其中论文 Auto Encoding Variational Bayes 中的Eq 10 怎么也推不出来 看了一下Appendix B 只给出了KL散度
  • 开发自己的脚手架(Rollup+Typescript)-(02)-(中间件模式)

    对于A gt b gt c这一类的流程事件 可以采用分解这些事件 当需要用到这些事件操作时 我们将操作插入到核心事件完成所需要的不同步骤中 实现一个流程处理函数 src core ware ts 中间件方法类型 export type Mi
  • ES6 let 与 const 命令 以及箭头函数初步学习

    ES6 let 与 const 命令 以及箭头函数初步学习 ES6 let 与 const 命令 以及箭头函数初步学习 let 与 const let 块级作用于 const 本质 ES6 声明变量的六种方法 ES6箭头函数 箭头函数与普通
  • CRM常用功能代码

    文章目录 前言 学习任务 一 常用框架 BS入口 方法体 二 常用功能 1 动态Pick 2 简单查询 3 通过值列表Code获取值 获取系统参数 模板 内置参数 4 开关管理员模式 遇到问题及其解决方案 心得总结 前言 学习情况总结 学习
  • ARMv8之arm64架构汇编知识

    1 寄存器 1 1 通用寄存器 31 个R0 R30 每一个寄存器能够存取一个64位大小的数 当使用 x0 x30访问时 是一个 64位的数 当使用 w0 w30访问时 是一个 32 位的数 访问的是寄存器的低32位 如下图所示 1 2 向
  • 线性表的顺序存储实现(数组)

    数据对象集 线性表是个元素构成的有序序列 操作集 线性表 整数i表示位置 元素 线性表主要操作如下 List MakeEmpty 初始化一个空线性表 ElementType FinKth int K List PtrL 根据位序K 返回相应
  • matlab迭代算法实例_【优化求解】基于NSGA2的求解多目标柔性车间调度算法

    柔性作业车间调度问题 FJSP 是经典作业车间调度问题的重要扩展 其中每个操作可以在多台机器上处理 反之亦然 结合实际生产过程中加工时间 机器负载 运行成本等情况 建立了多目标调度模型 针对NSGA2算法收敛性不足的缺陷 引入免疫平衡原理改
  • 比特位计数

    题目链接 比特位计数 题目描述 注意点 对于 0 lt i lt n 中的每个 i 计算其二进制表示中 1 的个数 解答思路 采用动态规划的思想 任意一个数字的1的个数可以由前面数字1的个数推出 除2的n次方的数字外 所以任意一个数字有两种
  • 解决:AttributeError: ‘generator‘ object has no attribute ‘next‘

    报错信息 AttributeError generator object has no attribute next 解决 经过查询发现是python版本之间的问题 把原来的next改为 next 注意是两个下划线
  • vscode 单行注释和多行注释

    单行注释 ctrl 多行注释 alt shift A 代码快速格式化 alt shift F 函数注释 后回车即可
  • 通过Python解决分布式爬虫中的代理难题

    在当今信息爆炸的时代 爬虫技术成为了获取互联网数据的重要手段 然而 随着网站对爬虫的限制越来越严格 分布式爬虫面临的代理难题也日益突出 本文将为你介绍一些实用的Python解决方案 帮助你轻松应对分布式爬虫中的代理问题 让你事半功倍 1 使
  • 利用github.io(githubPages)免费托管个人静态网站/个人博客

    我们的个人博客或者静态网站可以托管到github就能通过github域名访问 免费 省事 当然也可以使用自定义的域名解析 花钱 高大上 git仓库配置 我采用的是自己编写一个html文件 githubPages搭建 首先需要在GitHub上
  • 强化学习算法 Sarsa 解迷宫游戏,代码逐条详解

    本文内容源自百度强化学习 7 日入门课程学习整理 感谢百度 PARL 团队李科浇老师的课程讲解 强化学习算法 Sarsa 解迷宫游戏 文章目录 一 安装依赖库 二 导入依赖库 三 智能体 Agent 的算法 Sarsa 四 训练和测试语句