博客
关于我
强化学*读*笔* - 06~07 - 时序差分学*(Temporal-Difference Learning)
阅读量:442 次
发布时间:2019-03-06

本文共 13879 字,大约阅读时间需要 46 分钟。

强化学*读*笔* - 06~07 - 时序差分学*(Temporal-Difference Learning)

学*笔*:

数学符号看不懂的,先看看这里:

时序差分学*简话

时序差分学*结合了动态规划和蒙特卡洛方法,是强化学*的核心思想。

时序差分这个词不好理解。改为当时差分学*比较形象一些 - 表示通过当前的差分数据来学*。

蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。

时序差分学*是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时序差分学*。
本章只考虑单步的时序差分学*。多步的时序差分学*在下一章讲解。

数学表示

根据我们已经知道的知识:如果可以计算出策略价值(\(\pi\)状态价值\(v_{\pi}(s)\),或者行动价值\(q_{\pi(s, a)}\)),就可以优化策略。
在蒙特卡洛方法中,计算策略的价值,需要完成一个情节(episode),通过情节的目标价值\(G_t\)来计算状态的价值。其公式:
Formula MonteCarlo

\[V(S_t) \gets V(S_t) + \alpha \delta_t \\\delta_t = [G_t - V(S_t)] \\where \\\delta_t \text{ - Monte Carlo error} \\\alpha \text{ - learning step size}\]

时序差分的思想是通过下一个状态的价值计算状态的价值,形成一个迭代公式(又):

Formula TD(0)

\[V(S_t) \gets V(S_t) + \alpha \delta_t \\\delta_t = [R_{t+1} + \gamma\ V(S_{t+1} - V(S_t)] \\where \\\delta_t \text{ - TD error} \\\alpha \text{ - learning step size} \\\gamma \text{ - reward discount rate}\]

注:*上提出TD error并不精确,而Monte Carlo error是精确地。需要了解,在此并不拗述。

时序差分学*方法

本章介绍的是时序差分学*的单步学*方法。多步学*方法在下一章介绍。

  • 策略状态价值\(v_{\pi}\)的时序差分学*方法(单步\多步)
  • 策略行动价值\(q_{\pi}\)的on-policy时序差分学*方法: Sarsa(单步\多步)
  • 策略行动价值\(q_{\pi}\)的off-policy时序差分学*方法: Q-learning(单步)
  • Double Q-learning(单步)
  • 策略行动价值\(q_{\pi}\)的off-policy时序差分学*方法(带importance sampling): Sarsa(多步)
  • 策略行动价值\(q_{\pi}\)的off-policy时序差分学*方法(不带importance sampling): Tree Backup Algorithm(多步)
  • 策略行动价值\(q_{\pi}\)的off-policy时序差分学*方法: \(Q(\sigma)\)(多步)

策略状态价值\(v_{\pi}\)的时序差分学*方法

单步时序差分学*方法TD(0)

  • 流程图
Reinforcement Learning - TD0
Reinforcement Learning - TD0
s
S
s_1
S'
s->s_1
 A  
R
v
V(S)
s->v
v_1
V(S')
s_1->v_1
v->v_1
R
  • 算法描述

Initialize \(V(s)\) arbitrarily \(\forall s \in \mathcal{S}^+\)

Repeat (for each episode):
  Initialize \(\mathcal{S}\)
  Repeat (for each step of episode):
   \(A \gets\) action given by \(\pi\) for \(S\)
   Take action \(A\), observe \(R, S'\)
   \(V(S) \gets V(S) + \alpha [R + \gamma V(S') - V(S)]\)
   \(S \gets S'\)
  Until S is terminal

多步时序差分学*方法

  • 流程图
Reinforcement Learning - TD n
Reinforcement Learning - TD n
s
S
s_1
...
s->s_1
 A0  
Rk
v
V(S)
s->v
s_2
Sn
s_1->s_2
 An-1  
Rn
v_1
V(...)
s_1->v_1
v_2
V(Sn)
s_2->v_2
v->v_1
Rk
v_1->v_2
Rn
  • 算法描述

Input: the policy \(\pi\) to be evaluated

Initialize \(V(s)\) arbitrarily \(\forall s \in \mathcal{S}\)
Parameters: step size \(\alpha \in (0, 1]\), a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

Repeat (for each episode):

  Initialize and store \(S_0 \ne terminal\)
\(T \gets \infty\)
  For \(t = 0,1,2,\cdots\):
   If \(t < T\), then:
    Take an action according to \(\pi(\dot \ | S_t)\)
    Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
    If \(S_{t+1}\) is terminal, then \(T \gets t+1\)
   $\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
   If \(\tau \ge 0\):
    \(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
    if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n}V(S_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
    \(V(S_{\tau}) \gets V(S_{\tau}) + \alpha [G - V(S_{\tau})]\)
  Until \(\tau = T - 1\)

这里要理解\(V(S_0)\)是由\(V(S_0), V(S_1), \dots, V(S_n)\)计算所得;\(V(S_1)\)是由\(V(S_1), V(S_1), \dots, V(S_{n+1})\)

策略行动价值\(q_{\pi}\)的on-policy时序差分学*方法: Sarsa

单步时序差分学*方法

  • 流程图
Reinforcement Learning - TD Sarsa
Reinforcement Learning - TD Sarsa
s
S
s_1
S'
s->s_1
 A  
R
q
Q(S, A)
s->q
q_1
Q(S', A')
s_1->q_1
q->q_1
R
  • 算法描述

Initialize \(Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily, and \(Q(terminal, \dot \ ) = 0\)

Repeat (for each episode):
  Initialize \(\mathcal{S}\)
  Choose \(A\) from \(S\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
  Repeat (for each step of episode):
   Take action \(A\), observe \(R, S'\)
   Choose \(A'\) from \(S'\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
   \(Q(S, A) \gets Q(S, A) + \alpha [R + \gamma Q(S', A') - Q(S, A)]\)
   \(S \gets S'; A \gets A';\)
  Until S is terminal

多步时序差分学*方法

  • 流程图
Reinforcement Learning - TD Sarsa
Reinforcement Learning - TD Sarsa
s
S
s_1
...
s->s_1
 A  
Rk
q
Q(S, A)
s->q
s_2
Sn
s_1->s_2
An-1
Rn
q_1
Q(...)
s_1->q_1
q_2
Q(Sn, An)
s_2->q_2
q->q_1
Rk
q_1->q_2
Rn
  • 算法描述

Initialize \(Q(s, a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)

Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
  small \(\epsilon > 0\)
  a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

Repeat (for each episode):

  Initialize and store \(S_0 \ne terminal\)
  Select and store an action \(A_0 \sim \pi(\dot \ | S_0)\)
\(T \gets \infty\)
  For \(t = 0,1,2,\cdots\):
   If \(t < T\), then:
    Take an action \(A_t\)
    Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
    If \(S_{t+1}\) is terminal, then:
     \(T \gets t+1\)
    Else:
     Select and store an action \(A_{t+1} \sim \pi(\dot \ | S_{t+1})\)
   $\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
   If \(\tau \ge 0\):
    \(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
    if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n} Q(S_{\tau + n}, A_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
    \(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha [G - Q(S_{\tau}, A_{\tau})]\)
    If {\pi} is being learned, then ensure that \(\pi(\dot \ | S_{\tau})\) is \(\epsilon\)-greedy wrt Q
  Until \(\tau = T - 1\)

策略行动价值\(q_{\pi}\)的off-policy时序差分学*方法: Q-learning

Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学*。

\[Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \underset{a}{max} \ Q(S_{t+1}, a) - Q(S_t, A_t)]\]

单步时序差分学*方法

  • 算法描述

Initialize \(Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily, and \(Q(terminal, \dot \ ) = 0\)

Repeat (for each episode):
  Initialize \(\mathcal{S}\)
  Choose \(A\) from \(S\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
  Repeat (for each step of episode):
   Take action \(A\), observe \(R, S'\)
   \(Q(S, A) \gets Q(S, A) + \alpha [R + \gamma \underset{a}{max} \ Q(S‘, a) - Q(S, A)]\)
   \(S \gets S';\)
  Until S is terminal

  • Q-learning使用了max,会引起一个最大化偏差(Maximization Bias)问题。
    具体说明,请看*上的Example 6.7。**
    使用Double Q-learning可以消除这个问题。

Double Q-learning

单步时序差分学*方法

Initialize \(Q_1(s, a)\) and \(Q_2(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily

Initialize \(Q_1(terminal, \dot \ ) = Q_2(terminal, \dot \ ) = 0\)
Repeat (for each episode):
  Initialize \(\mathcal{S}\)
  Repeat (for each step of episode):
   Choose \(A\) from \(S\) using policy derived from \(Q_1\) and \(Q_2\) (e.g. \(\epsilon-greedy\))
   Take action \(A\), observe \(R, S'\)
   With 0.5 probability:
    \(Q_1(S, A) \gets Q_1(S, A) + \alpha [R + \gamma Q_2(S', \underset{a}{argmax} \ Q_1(S', a)) - Q_1(S, A)]\)
   Else:
    \(Q_2(S, A) \gets Q_2(S, A) + \alpha [R + \gamma Q_1(S', \underset{a}{argmax} \ Q_2(S', a)) - Q_2(S, A)]\)
   \(S \gets S';\)
  Until S is terminal

策略行动价值\(q_{\pi}\)的off-policy时序差分学*方法(by importance sampling): Sarsa

考虑到重要样本,把\(\rho\)带入到Sarsa算法中,形成一个off-policy的方法。

\(\rho\) - 重要样本比率(importance sampling ratio)

\[\rho \gets \prod_{i = \tau + 1}^{min(\tau + n - 1, T -1 )} \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \qquad \qquad (\rho_{\tau+n}^{(\tau+1)})\]

多步时序差分学*方法

  • 算法描述

Input: behavior policy \mu such that \(\mu(a|s) > 0,\forall s \in \mathcal{S}, a \in \mathcal{A}\)

Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
  small \(\epsilon > 0\)
  a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

Repeat (for each episode):

  Initialize and store \(S_0 \ne terminal\)
  Select and store an action \(A_0 \sim \mu(\dot \ | S_0)\)
\(T \gets \infty\)
  For \(t = 0,1,2,\cdots\):
   If \(t < T\), then:
    Take an action \(A_t\)
    Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
    If \(S_{t+1}\) is terminal, then:
     \(T \gets t+1\)
    Else:
     Select and store an action \(A_{t+1} \sim \pi(\dot \ | S_{t+1})\)
   $\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
   If \(\tau \ge 0\):
    \(\rho \gets \prod_{i = \tau + 1}^{min(\tau + n - 1, T -1 )} \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \qquad \qquad (\rho_{\tau+n}^{(\tau+1)})\)
    \(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
    if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n} Q(S_{\tau + n}, A_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
    \(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha \rho [G - Q(S_{\tau}, A_{\tau})]\)
    If {\pi} is being learned, then ensure that \(\pi(\dot \ | S_{\tau})\) is \(\epsilon\)-greedy wrt Q
  Until \(\tau = T - 1\)

Expected Sarsa

  • 流程图
Reinforcement Learning - TD Expected Sarsa
Reinforcement Learning - TD Expected Sarsa
s
S
s_1
...
s->s_1
 A  
Rk
q
Q(S, A)
s->q
s_2
Sn
s_1->s_2
An-1
Rn
q_1
Q(...)
s_1->q_1
q_2
sum(pi(a|Sn) * Q(Sn, a))
s_2->q_2
q->q_1
Rk
q_1->q_2
Rn
* 算法描述略。

策略行动价值\(q_{\pi}\)的off-policy时序差分学*方法(不带importance sampling): Tree Backup Algorithm

Tree Backup Algorithm的思想是每步都求行动价值的期望值。

求行动价值的期望值意味着对所有可能的行动\(a\)都评估一次。

多步时序差分学*方法

  • 流程图
Reinforcement Learning - TD Tree Backup
Reinforcement Learning - TD Tree Backup
s
S
s_1
...
s->s_1
 A  
Rk
q
Q(S, A)
s->q
s_2
Sn
s_1->s_2
An-1
Rn
q_1
sum(pi(a|...) * Q(..., a))
s_1->q_1
q_2
sum(pi(a|Sn) * Q(Sn, a))
s_2->q_2
q->q_1
Rk
q_1->q_2
Rn
  • 算法描述

Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)

Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
  small \(\epsilon > 0\)
  a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

Repeat (for each episode):

  Initialize and store \(S_0 \ne terminal\)
  Select and store an action \(A_0 \sim \pi(\dot \ | S_0)\)
\(Q_0 \gets Q(S_0, A_0)\)
\(T \gets \infty\)
  For \(t = 0,1,2,\cdots\):
   If \(t < T\), then:
    Take an action \(A_t\)
    Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
    If \(S_{t+1}\) is terminal, then:
     \(T \gets t+1\)
     \(\delta_t \gets R - Q_t\)
    Else:
     \(\delta_t \gets R + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q_t\)
     Select arbitrarily and store an action as \(A_{t+1}\)
     \(Q_{t+1} \gets Q(S_{t+1},A_{t+1})\)
     \(\pi_{t+1} \gets \pi(S_{t+1},A_{t+1})\)
   $\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
   If \(\tau \ge 0\):
    \(E \gets 1\)
    \(G \gets Q_{\tau}\)
    For \(k=\tau, \dots, min(\tau + n - 1, T - 1):\)
     \(G \gets\ G + E \delta_k\)
     \(E \gets\ \gamma E \pi_{k+1}\)
    \(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha [G - Q(S_{\tau}, A_{\tau})]\)
    If {\pi} is being learned, then ensure that \(\pi(a | S_{\tau})\) is \(\epsilon\)-greedy wrt \(Q(S_{\tau},\dot \ )\)
  Until \(\tau = T - 1\)

策略行动价值\(q_{\pi}\)的off-policy时序差分学*方法: \(Q(\sigma)\)

\(Q(\sigma)\)结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。

\(\sigma = 1\)时,使用了重要样本的Sarsa算法。
\(\sigma = 0\)时,使用了Tree Backup的行动期望值算法。

多步时序差分学*方法

  • 流程图
Reinforcement Learning - TD Q(sigma)
Reinforcement Learning - TD Q(sigma)
s
S
s_1
...
s->s_1
 A  
R.
q
Q(S, A)
s->q
s_2
...
s_1->s_2
A.
R.
q_1
Q(...)
s_1->q_1
sigma = 1
s_3
...
s_2->s_3
A.
R.
q_2
sum(pi(a|...) * Q(...,a))
s_2->q_2
sigma = 0
s_4
Sn
s_3->s_4
An-1
Rn
q_3
Q(...)
s_3->q_3
sigma = 1
q_4
sum(pi(a|Sn) * Q(Sn,a))
s_4->q_4
sigma = 0
q->q_1
R.
q_1->q_2
R.
q_2->q_3
R.
q_3->q_4
Rn
  • 算法描述

Input: behavior policy \mu such that \(\mu(a|s) > 0,\forall s \in \mathcal{S}, a \in \mathcal{A}\)

Initialize \(Q(s,a)\) arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
  small \(\epsilon > 0\)
  a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

Repeat (for each episode):

  Initialize and store \(S_0 \ne terminal\)
  Select and store an action \(A_0 \sim \mu(\dot \ | S_0)\)
\(Q_0 \gets Q(S_0, A_0)\)
\(T \gets \infty\)
  For \(t = 0,1,2,\cdots\):
   If \(t < T\), then:
    Take an action \(A_t\)
    Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
    If \(S_{t+1}\) is terminal, then:
     \(T \gets t+1\)
     \(\delta_t \gets R - Q_t\)
    Else:
     Select and store an action as \(A_{t+1} \sim \mu(\dot \ |S_{t+1})\)
     Select and store \(\sigma_{t+1})\)
     \(Q_{t+1} \gets Q(S_{t+1},A_{t+1})\)
     \(\delta_t \gets R + \gamma \sigma_{t+1} Q_{t+1} + \gamma (1 - \sigma_{t+1})\sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q_t\)
     \(\pi_{t+1} \gets \pi(S_{t+1},A_{t+1})\)
     \(\rho_{t+1} \gets \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})}\)
   $\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
   If \(\tau \ge 0\):
    \(\rho \gets 1\)
    \(E \gets 1\)
    \(G \gets Q_{\tau}\)
    For \(k=\tau, \dots, min(\tau + n - 1, T - 1):\)
     \(G \gets\ G + E \delta_k\)
     \(E \gets\ \gamma E [(1 - \sigma_{k+1})\pi_{k+1} + \sigma_{k+1}]\)
     \(\rho \gets\ \rho(1 - \sigma_{k} + \sigma_{k}\tau_{k})\)
    \(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha \rho [G - Q(S_{\tau}, A_{\tau})]\)
    If \({\pi}\) is being learned, then ensure that \(\pi(a | S_{\tau})\) is \(\epsilon\)-greedy wrt \(Q(S_{\tau},\dot \ )\)
  Until \(\tau = T - 1\)

总结

时序差分学*方法的限制:学*步数内,可获得奖赏信息。

比如,国际象棋的每一步,是否可以计算出一个奖赏信息?如果使用蒙特卡洛方法,模拟到游戏结束,肯定是可以获得一个奖赏结果的。

参照

转载地址:http://vwafz.baihongyu.com/

你可能感兴趣的文章
Spring Boot 2.4 配置文件将加载机制大变化
查看>>
也来玩玩 javascript对象深拷贝,浅拷贝
查看>>
【转载】Kubernetes CNI网络最强对比:Flannel、Calico、Canal和Weave
查看>>
Kubernetes实战总结 - 动态存储管理StorageClass
查看>>
wcf webHttpBinding Post 大数据量提交 ios c#客户端
查看>>
[LeetCode题解]141. 环形链表 | 快慢指针
查看>>
MySQL错误日志(Error Log)
查看>>
12.Linux之输入子系统分析(详解)
查看>>
源码解析之 Mybatis 对 Integer 参数做了什么手脚?
查看>>
oracle使用DBMS_RANDOM包生成随机数据
查看>>
C++高精度模板
查看>>
错题重错之WYT的刷子 单调队列
查看>>
联赛模拟测试23 D. 真相 思维题
查看>>
牛顿迭代学习笔记
查看>>
Scala中的空
查看>>
设计模式学习笔记(二十三:解释器模式)
查看>>
Databricks 第4篇:pyspark.sql 分组统计和窗口
查看>>
SSISDB2:SSIS工程的操作实例
查看>>
业务工作流平台设计(七)
查看>>
业务工作流平台设计(八)
查看>>