Q-Learning 算法详解

Core Algorithm

Q-Learning 核心思想

通过 Q(s,a) = r + γ·max_a' Q(s',a') 逼近最优策略


# 伪代码
while not converged:
    s = env.reset()
    while not done:
        a = ε-greedy(Q, s)
        s', r, done = env.step(a)
        Q(s,a) = Q(s,a) + α[r + γ·max_a' Q(s',a') - Q(s,a)]

Performance Comparison

实验对比：Q-Learning vs DQN

算法	平均奖励	收敛速度	稳定性
Q-Learning	85.2	中等	低
DQN	92.7	快	高
Double DQN	94.1	更快	很高

Key Innovation

关键改进：经验回放

传统方法问题

样本相关性高
数据利用效率低
训练不稳定

DQN解决方案

存储经验池
随机采样减少相关性
提升数据利用率