◇【论文_20181020 v6】广义优势估计器 (generalized advantage estimator, GAE)

https://arxiv.org/abs/1506.02438

ICLR 2016
加州伯克利电子工程与计算机科学系

High-Dimensional Continuous Control Using Generalized Advantage Estimation

文章目录

摘要
1 引言
2 预备知识
3 优势函数估计
4 解释为奖励设计reward shaping
5 价值函数估计
6 实验
- 6.1 策略优化算法
- 6.2 实验设置
- - 6.2.1 网络架构
  - 6.2.2 任务细节
- 6.3 实验结果
- - 6.3.1 Cart-pole
  - 6.3.2 3D 双足运动
  - 6.3.3 其他 3D 机器人任务
7 讨论
致谢
A 常见问题
B 证明

摘要

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. 【xx 是… (定义)，(优点)】
策略梯度方法在强化学习中是一种极好的方法，因为它们直接优化累积奖励，并且可以直接与非线性函数近似器(如神经网络)一起使用。
The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. 【当前的 2 个挑战】
两个主要的挑战是通常需要大量的样本，以及尽管输入数据具有非平稳性，但难以获得稳定持续的改进。
We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD( $\lambda$ ). 【本文相应的 2 个应对方法】
我们通过使用价值函数来解决第一个挑战，以一些偏差为代价大幅减少策略梯度估计的方差，并使用类似于 TD( $\lambda$ ) 的优势函数的指数加权估计器。
We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks.
我们通过对由神经网络表示的策略和价值函数使用信任域优化程序来解决第二个挑战。

Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. 【做了哪些实验来评估拟议解决方案】
我们的方法在极具挑战性的 3D 运动任务中产生了强有力的经验结果，学习双足和四足模拟机器人的跑步步态，以及学习让双足动物从开始躺在地上到站起来的策略。
In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. 【与其他类似方法相比，优点】
与之前使用手工制作策略表示的工作相比，我们的神经网络策略直接从原始运动学映射到关节扭矩。
Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.
我们的算法是完全无模型的model-free，3D 两足动物学习任务所需的模拟经验量对应于 1-2 周的现实时间。

1 引言

↓ 【拟处理的问题的基本背景】

The typical problem formulation in reinforcement learning is to maximize the expected total reward of a policy.
强化学习中典型的问题表述是最大化策略的总奖励期望。
A key source of difficulty is the long time delay between actions and their positive or negative effect on rewards; this issue is called the credit assignment problem in the reinforcement learning literature (Minsky, 1961; Sutton & Barto, 1998), and the distal reward problem in the behavioral literature (Hull, 1943).
困难的一个关键来源是动作与其对奖励的积极或消极影响之间的长时间延迟;
这个问题在强化学习文献中被称为信度分配问题credit assignment problem (Minsky, 1961;Sutton & Barto, 1998)，以及行为文献中的远端奖励问题distal reward problem (Hull, 1943)。
Value functions offer an elegant solution to the credit assignment problem—they allow us to estimate the goodness of an action before the delayed reward arrives.
价值函数为信度分配问题提供了一个优雅的解决方案——它们允许我们在延迟奖励到来之前估计一个动作的好坏。
Reinforcement learning algorithms make use of value functions in a variety of different ways; this paper considers algorithms that optimize a parameterized policy and use value functions to help estimate how the policy should be improved.
强化学习算法以各种不同的方式利用价值函数;
本文考虑优化参数化策略 并使用价值函数来帮助估计策略应该如何改进的算法。

↓ 【拟处理的问题：偏差-方差的 trade-off权衡问题】

When using a parameterized stochastic policy, it is possible to obtain an unbiased estimate of the gradient of the expected total returns (Williams, 1992; Sutton et al., 1999; Baxter & Bartlett, 2000); these noisy gradient estimates can be used in a stochastic gradient ascent algorithm.
当使用参数化随机策略时，有可能获得总回报的期望的梯度的无偏估计;
这些带噪声的梯度估计可用于随机梯度上升算法。
Unfortunately, the variance of the gradient estimator scales unfavorably with the time horizon, since the effect of an action is confounded with the effects of past and future actions.
不幸的是，梯度估计器的方差随着时间范围的变化而变化，因为一个动作的影响与 过去和未来动作的影响混杂。
Another class of policy gradient algorithms, called actor-critic methods, use a value function rather than the empirical returns, obtaining an estimator with lower variance at the cost of introducing bias (Konda & Tsitsiklis, 2003; Hafner & Riedmiller, 2011).
另一类策略梯度算法，称为行动者-评论员actor-critic 方法，使用价值函数而不是经验回报，以引入偏差为代价获得方差较低的估计值 (Konda & Tsitsiklis, 2003;Hafner & Riedmiller, 2011)。
But while high variance necessitates using more samples, bias is more pernicious—even with an unlimited number of samples, bias can cause the algorithm to fail to converge, or to converge to a poor solution that is not even a local optimum.
虽然高方差需要使用更多的样本，但偏差更有害——即使有无限数量的样本，偏差也会导致算法无法收敛，或者收敛到一个甚至不是局部最优的糟糕解。

行动者-评论员actor-critic 方法：使用价值函数获得 低方差，这样需要的样本少些。代价是：
偏差变大 ——> 算法无法收敛，或者收敛到一个甚至不是局部最优的糟糕解。

↓ 【我们提出了…(作用是)，关键的 idea】

We propose a family of policy gradient estimators that significantly reduce variance while maintaining a tolerable level of bias.
我们提出了一组策略梯度估计器，它们在保持可接受的偏差水平的同时显著减少方差。
We call this estimation scheme, parameterized by $\gamma \in [0, 1]$ and $\lambda \in [0, 1]$ , the generalized advantage estimator (GAE).
我们将这种由 $\gamma \in [0, 1]$ 和 $\lambda \in [0, 1]$ 参数化的估计方案称为 广义优势估计器 (generalized advantage estimator, GAE)。
Related methods have been proposed in the context of online actor-critic methods (Kimura & Kobayashi, 1998; Wawrzyński, 2009).
在在线 actor-critic 方法的背景下，已经提出了相关的方法。
We provide a more general analysis, which is applicable in both the online and batch settings, and discuss an interpretation of our method as an instance of reward shaping (Ng et al., 1999), where the approximate value function is used to shape the reward.
我们提供了一个更一般的分析，这适用于在线和批处理设置，并讨论了我们的方法作为奖励设计reward shaping 的一个实例的解释 (Ng et al., 1999)，其中近似的价值函数用于设计奖励。

↓ 【在哪些任务上进行了方法评估】

We present experimental results on a number of highly challenging 3D locomotion tasks, where we show that our approach can learn complex gaits using high-dimensional, general purpose neural network function approximators for both the policy and the value function, each with over 104 parameters.
我们在许多极具挑战性的 3D 运动任务上展示了实验结果，实验结果表明我们的方法可以使用高维、通用的神经网络函数近似器来学习复杂的步态，这些神经网络函数近似器用于策略和价值函数，每个函数的参数个数都超过 104 。
The policies perform torque-level control of simulated 3D robots with up to 33 state dimensions and 10 actuators.
该策略对具有多达 33 个状态维度和 10 个执行器的模拟 3D 机器人进行扭矩级控制。

↓ 【3 点贡献】

The contributions of this paper are summarized as follows:
本文的贡献总结如下:

We provide justification and intuition for an effective variance reduction scheme for policy gradients, which we call generalized advantage estimation (GAE). 【贡献 1：减小方差的方案 GAE】
我们为策略梯度的有效的方差减少方案提供了解释和直觉，这个方案我们称之为广义优势估计(generalized advantage estimation,GAE)。
While the formula has been proposed in prior work (Kimura & Kobayashi, 1998; Wawrzyński, 2009), our analysis is novel and enables GAE to be applied with a more general set of algorithms, including the batch trust-region algorithm we use for our experiments.
该公式已在先前的工作中被提出 (Kimura & Kobayashi, 1998;Wawrzyński, 2009)，我们的分析是新颖的，使 GAE 能够应用于一组更通用的算法，包括我们在实验中使用的批处理信任区域算法。
We propose the use of a trust region optimization method for the value function, which we find is a robust and efficient way to train neural network value functions with thousands of parameters. 【贡献 2：价值函数 + 信任域优化】
我们提出了对价值函数使用信任域优化方法，我们发现这是一种稳健且有效的方法来训练具有数千个参数的神经网络价值函数。
By combining (1) and (2) above, we obtain an algorithm that empirically is effective at learning neural network policies for challenging control tasks. 【贡献 3：高维连续控制算法】
通过结合上述 (1) 和 (2)，我们获得了一种用于学习具有挑战性的控制任务的神经网络策略的经验上有效的算法。
The results extend the state of the art in using reinforcement learning for high-dimensional continuous control.
结果扩展了在高维连续控制中使用强化学习的最先进水平。
Videos are available at https://sites.google.com/site/gaepapersupp.
视频链接

2 预备知识

我们考虑策略优化问题的一种无折扣表述。
初始状态 $s_0$ 是从分布 $\rho_0$ 中采样的。
根据策略 $a_t \sim\pi (a_t | s_t)$ 采样动作，根据动态 $s_{t+1}\sim P(s_{t+1}| s_t, a_t)$ 采样状态，直至达到终止(吸收)状态terminal (absorbing) state，得到轨迹trajectory $(s_0, a_0, s_1, a_1,\cdots)$ 。
奖励reward $r_t = r(s_t, a_t, s_{t+1})$ 是在每个时间步收到的。
目标是最大化总奖励 $\sum\limits_{t=0}^\infty r_t$ 的期望，假设对所有策略都是有限的。
请注意，我们没有将折扣作为问题说明的一部分;
它将在下面显示为调整偏差-方差权衡的算法参数。
但折扣问题(最大化 $\sum\limits_{t=0}^\infty \gamma^t r_t$ )可以作为无折扣问题的一个实例来处理，其中我们将折扣因子吸收到奖励函数中，使其与时间相关。

策略梯度方法通过反复估计梯度 $g:=\nabla_\theta{\mathbb E}\Big[\sum\limits_{t=0}^\infty r_t\Big]$ 来最大化总奖励的期望。
策略梯度有几个不同的相关表达式，其形式为：

$g={\mathbb E}\Big[\sum\limits_{t=0}^\infty\Psi_t\nabla_\theta\log\pi_\theta(a_t|s_t)\Big]~~~~~~~~~~(1)$

其中 $\Psi_t$ 可以是以下的其中一个：

1. $~~~\sum\limits_{t=0}^\infty r_t~~~$ 轨迹的总奖励

2. $~~~\sum\limits_{\textcolor{blue}{t^\prime=t}}^\infty r_{t^\prime}~~~$ 动作 $a_t$ 之后的奖励

3. $~~~\sum\limits_{t^\prime=t}^\infty r_{t^\prime}\textcolor{blue}{-b(s_t)}~~~$ 之前形式的基线版本这个基线只根据状态 $s_t$ 计算一次，后续时间步不再计算

4. $~~~Q^\pi(s_t, a_t)~~~$ 状态-动作价值函数state-action value function

$~~~~~~Q^\pi(s_t,a_t):={\mathbb E}_{s_{t+1:\infty},~a_{t+1:\infty}}\Big[\sum\limits_{l=0}^\infty r_{t+l}\Big]$

5. $~~~A^\pi(s_t, a_t)~~~$ 优势函数advantage function

$~~~~~~A^\pi(s_t, a_t):=Q^\pi(s_t,a_t)\textcolor{blue}{-V^\pi(s_t)}=Q^\pi(s_t,a_t)-{\mathbb E}_{s_{t+1:\infty},~a_{\textcolor{blue}{t}:\infty}}\Big[\sum\limits_{l=0}^\infty r_{t+l}\Big]$

6. $~~~r_t+V^\pi(s_{t+1})-V^\pi(s_t)~~~$ TD残差 TD residual ( temporal-difference 时序差分)

这里， $\mathbb E$ 的下标列举了被积分的变量，其中状态和动作分别从动态模型 $P(s_{t+1} | s_t, a_t)$ 和策略 $\pi(a_t|s_t)$ 中依次采样。
冒号 $a : b$ 表示包含范围 $1,\cdots,b)$ 。
这些公式是众所周知的，而且很容易得到;
它们直接来自命题 1，稍后会说明。

选择 $\Psi_t = A^\pi(s_t, a_t)$ 产生的方差几乎是最小的，尽管在实践中，优势函数是未知的，必须进行估计。
这种说法可以通过以下对策略梯度的解释直观地证明：在策略梯度方向上的一步应该会增加优于平均水平的动作actions 的概率，并降低低于平均水平的动作actions 的概率。
根据优势函数 的定义 $A^\pi (s, a) = Q^\pi (s, a) - V^\pi(s)$ ，衡量该动作是比策略的默认行为更好还是更差。
因此，我们令 $Ψ_t$ 为优势函数 $^\pi (s_t, a_t)$ ，使得当且仅当 $A^\pi (s_t, a_t) > 0$ 时，梯度项 $Ψ_t\nabla_\theta \log π_\theta(a_t | s_t)$ 指向 $π (a_t | s_t)$ 增大的方向。
参见 Greensmith 等人(2004) 对策略梯度估计器的方差和使用基线的影响进行的更严谨的分析。

我们将引入一个参数 $\gamma$ ，它让我们通过降低与延迟效应相对应的奖励的权重来减小方差，代价是引入偏差。
该参数对应于 MDPs 折扣公式中使用的折扣因子，但我们将其视为无折扣问题中的方差减小参数;
Marbach & Tsitsiklis (2003); Kakade (2001b); Thomas (2014) 对该技术进行了理论分析;
折扣价值函数由下式给出:

$V^{\pi,\gamma}(s_t):={\mathbb E}_{s_{t+1:\infty},~a_{t:\infty}}\Big[\sum\limits_{l=0}^\infty \textcolor{blue}{\gamma^l} r_{t+l}\Big]$

$Q^{\pi,\gamma}(s_t,a_t):={\mathbb E}_{s_{t+1:\infty},~a_{t+1:\infty}}\Big[\sum\limits_{l=0}^\infty \textcolor{blue}{\gamma^l} r_{t+l}\Big]~~~~~~~~~~(4)$

$A^{\pi,\gamma}(s_t,a_t):=Q^{\pi,\gamma}(s_t,a_t)-V^{\pi,\gamma}(s_t)~~~~~~~~~~(5)$

策略梯度的折扣近似定义如下:

$g^\gamma:={\mathbb E}_{s_{0:\infty},~a_{0:\infty}}\Big[\sum\limits_{t=0}^\infty A^{\pi,\gamma}(s_t,a_t)\nabla_\theta\log\pi_\theta(a_t|s_t)\Big]~~~~~~~~~~(6)$

下一节讨论如何获得 $A^{\pi,\gamma}$ 的有偏(但不是太偏)估计量，给出式 (6) 中折扣策略梯度的有噪声估计。

在继续之前，我们将引入优势函数的仅 $\gamma$ 估计器 $\gamma$ -just estimator 的概念，这是一个估计器，当我们用它代替式 (6) 中的 $A^{\pi,\gamma}$ (未知且必须估计) 来估计 $g^\gamma$ 时，它不会引入偏差。
考虑一个优势估计 $Â_t(s_{0:\infty}, a_{0:\infty})$ ，它通常可能是整个轨迹的函数。

脚注：请注意，我们已经通过使用 $A^{\pi,\gamma}$ 来代替 $A^\pi$ 引入了偏差;
这里我们关心的是获得 $g^\gamma$ 的无偏估计， $g^\gamma$ 是无折扣 MDP 的策略梯度的有偏估计。

命题 1：假设 $\hat A_t$ 可以写成 $Â_t(s_{0:\infty}, a_{0:\infty})= Q_t(s_{t:\infty}, a_{t:\infty})-b_t(s_{t:\infty}, a_{t:\infty})$ 使得对于所有 $s_t, a_t)$ ，有 ${\mathbb E}_{s_{t+1:\infty},a_{t+1:\infty}|s_t,a_t}[Q_t(s_{t:\infty}, a_{t:\infty})] = Q^{\pi,\gamma}(s_t, a_t)$ 。那么 $\hat{A}$ 是 $\gamma$ -just (仅 $\gamma$ 的)。

证明如附录 B 所示。
很容易验证以下表达式是 $Â_t$ 的 $\gamma$ -just 优势估计量:

$\sum\limits_{l=0}^\infty\gamma^lt_{t+l}$
$A^{\pi,\gamma}(s_t,a_t)$
$Q^{\pi,\gamma}(s_t,a_t)$
$r_t+\gamma V^{\pi,\gamma}(s_{t+1})-V^{\pi,\gamma}(s_t)$

3 优势函数估计

本节将关注产生折扣优势函数 $A^{\pi,\gamma}(s_t, a_t)$ 的准确估计 $\hat A_t$ ，然后用于构造以下形式的策略梯度估计器:

$\hat g=\frac{1}{N}\sum\limits_{n=1}^N\sum\limits_{t=0}^\infty\hat A_t^n\nabla _\theta\log \pi_\theta(a_t^n|s_t^n)~~~~~~~~~~(9)$

其中 $n$ 表示回合batch 的索引。

设 $V$ 是一个近似的价值函数。
定义 $\Delta _t^V=r_t +\gamma V(s_{t+1})-V(s_t)$ ，即折扣 $\gamma$ 的 $V$ 的 TD误差(Sutton & Barto, 1998)。
注意 $\Delta _t^V$ 可以被认为是对动作 $a_t$ 的优势的估计。
事实上，如果我们有正确的价值函数 $=V^{\pi,\gamma}$ ，那么它是一个 $\gamma$ -just 优势估计器 $\gamma$ -just advantage estimator，且实际上是 $A^{\pi,\gamma}$ 的无偏估计量:

$\begin{aligned}{\mathbb E}_{s_{t+1}}[\Delta_t^{V^{\pi,\gamma}}]&={\mathbb E}_{s_{t+1}}[\textcolor{blue}{r_t+\gamma V^{\pi,\gamma}(s_{t+1})}-V^{\pi, \gamma}(s_t)]\\ &={\mathbb E}_{s_{t+1}}[\textcolor{blue}{Q^{\pi,\gamma}(s_t,a_t)}-V^{\pi, \gamma}(s_t)]\\ &=A^{\pi,\gamma}(s_t,a_t)~~~~~~~~~~(10)\end{aligned}$

然而，该估计量仅对于 $V^{\pi, \gamma}$ 是 $\gamma$ -just，否则将产生有偏的策略梯度估计。

接下来，让我们考虑取前 $k$ 个 $\Delta$ 项的和，我们用 $\hat A_t^{(k)}$ 表示。

$\hat A_t^{(1)}:=\Delta_t^V=-V(s_t)+r_t +\gamma V(s_{t+1})$

$\begin{aligned}\hat A_t^{(2)}&:=\Delta_t^V+\gamma\Delta_{t+1}^V\\ &=-V(s_t)+r_t +\gamma V(s_{t+1}) + \gamma [-V(s_{t+1})+r_{t+1} +\gamma V(s_{t+2})]\\ &=-V(s_t)+r_t+\gamma r_{t+1}+\gamma^2V(s_{t+2})\end{aligned}$

$\begin{aligned}\hat A_t^{(3)}&:=\Delta_t^V+\gamma\Delta_{t+1}^V+\gamma^2\Delta_{t+2}^V\\ &=-V(s_t)+r_t+\gamma r_{t+1}+\gamma^2V(s_{t+2})+\gamma^2[-V(s_{t+2})+r_{t+2} +\gamma V(s_{t+3})]\\ &=-V(s_t)+r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+\gamma^3V(s_{t+3})\end{aligned}$
$~~~~~~~~~~\vdots$
$\hat A_t^{(k)}:=\sum\limits_{l=0}^{k-1}\gamma^l\Delta_{t+1}^V=\textcolor{blue}{-V(s_t)}+r_t+\gamma r_{t+1}+\cdots+\gamma^{\textcolor{blue}{k-1}}r_{t+\textcolor{blue}{k-1}}+\textcolor{blue}{\gamma^kV(s_{t+k})}~~~~~~~~~~(14)$

这些方程是由一个可伸缩和得出的，我们看到 $\hat A_t^{(k)}$ 涉及对回报的 $k$ 步估计，减去一个基线项 $V(s_t)$ 。
与 $\Delta_t^V=\hat A_t^{(1)}$ 的情况类似，我们可以认为 $\hat A_t^{(k)}$ 是优势函数的一个估计量，仅当 $V^{\pi, \gamma}$ 时它才是 $\gamma$ -just。
然而，请注意，当 $k→\infty$ 时，偏置通常会变小，因为 $\gamma^kV(s_{t+k})$ 项的大打折扣，而 $V(s_t)$ 项不会影响偏置。
取 $k→\infty$ ，我们得到

$\hat A_t^{(\infty)}:=\sum\limits_{l=0}^\infty\gamma^l\Delta_{t+1}^V=-V(s_t)+\sum\limits_{l=0}^\infty\gamma^l\Delta_{t+l}~~~~~~~~~~(15)$

也就是 经验回报减去价值函数基线。

广义优势估计器 $\text{GAE}(\gamma, \lambda)$ 定义为这些 $k$ 步估计器的指数加权平均:

$\begin{aligned}\hat A_t^{\text{GAE}(\gamma, \lambda)}&:=(1-\lambda)\Big(\hat A_t^{(1)}+\lambda \hat A_t^{(2)}+\lambda^2 \hat A_t^{(3)}+\cdots\Big)\\ &=(1-\lambda)(\Delta_t^V+\lambda(\Delta_t^V+\gamma \Delta_{t+1}^V)+\lambda^2(\Delta_t^V+\gamma \Delta_{t+1}^V+\gamma^2 \Delta_{t+2}^V)+\cdots)\\ &=(1-\lambda)(\Delta_t^V(1+\lambda + \lambda^2+\cdots)+\\ &~~~~~~~~~~~~~~~~~~\gamma \Delta_{t+1}^V(\lambda+\lambda^2+\lambda^3+\cdots)+\\ &~~~~~~~~~~~~~~~~~~\gamma^2\Delta_{t+2}^V(\lambda^2+\lambda^3+\lambda^4+\cdots)+\cdots)\\ &=(1-\lambda)\Big(\Delta_t^V(\frac{1}{1-\lambda}+\gamma\Delta_{t+1}^V(\frac{\lambda}{1-\lambda})+\gamma^2\Delta_{t+2}^V(\frac{\lambda^2}{1-\lambda})+\cdots\Big)\\ &=\sum\limits_{l=0}^\infty(\gamma\lambda)^l\Delta_{t+l}^V~~~~~~~~~~(16)\end{aligned}$

从式 (16) 中，我们看到优势估计器有一个非常简单的公式，涉及Bellman 残差项的折扣和。第 4 节讨论了对该公式的解释，即带有改动奖励函数的 MDP 中的回报。
我们上面使用的结构与用于定义 TD( $\lambda$ ) 的结构非常相似 (Sutton & Barto, 1998)，但是 TD( $\lambda$ ) 是价值函数的估计器，而这里我们是在估计优势函数。

这个公式有两个值得注意的特例，分别是令 $\lambda= 0$ 和 $\lambda= 1$ 。

$\text{GAE}(\gamma, 0):~~\hat A_t:=\Delta_t~~~~~~~~~~~~=r_t+\gamma V(s_{t+1})-V(s_t)~~~~~~~~~~(17)$

$\text{GAE}(\gamma, 1):~~\hat A_t:=\sum\limits_{l=0}^\infty\gamma^l\Delta_{t+l}=\sum\limits_{l=0}^\infty\gamma^lr_{t+l}-V(s_t)~~~~~~~~~~(18)~~~~$ 方差大，需要更多样本

不考虑 $V$ 的准确性， $\text{GAE}(\gamma, 1)$ 是 $\gamma$ -just，但由于项的和，它有很高的方差。
$\text{GAE}(\gamma, 0)$ 仅当 $V=V^{\pi, \gamma}$ 时是 $\gamma$ -just，否则会引起偏差，但它的方差通常要小得多。
$0<\lambda<1$ 的广义优势估计器是偏差和方差之间的折衷，由参数 $λ$ 控制。

我们描述了一个具有两个独立参数 $\gamma$ 和 $\lambda$ 的优势估计器，当使用近似价值函数时，这两个参数都有助于偏差-方差权衡。
然而，它们服务于不同的目的，并且在不同的值范围内效果最好。
$\gamma$ 最重要的是决定了价值函数 $V^{\pi,\gamma}$ 的范围，它不依赖于 $\lambda$ 。
当 $\gamma < 1$ 时，无论价值函数的精度如何，都会在策略梯度估计中引入偏差。
另一方面， $\lambda< 1$ 仅在价值函数不准确时才会引入偏差。
根据经验，我们发现 $\lambda$ 的最佳值远低于 $\gamma$ 的最佳值，这可能是因为对于一个相当准确的价值函数， $\lambda$ 引入的偏差远小于 $\gamma$ 。

利用广义优势估计器，我们可以从式 (6) 中构造折扣策略梯度 $g^\gamma$ 的有偏估计量:

$g^\gamma\approx{\mathbb E}\Big[\sum\limits_{t=0}^\infty\nabla_\theta\log \pi_\theta(a_t|s_t)\hat A_t^{\text{GAE}(\gamma,\lambda)}\Big]={\mathbb E}\Big[\sum\limits_{t=0}^\infty\nabla_\theta\log \pi_\theta(a_t|s_t)\sum\limits_{l=0}^\infty(\gamma\lambda)^l\Delta_{t+l}^V\Big]~~~~~~~~~~(19)$

当 $\lambda= 1$ 时，等式成立。

4 解释为奖励设计reward shaping

在本节中，我们将讨论如何将 $\lambda$ 解释为在 MDP上执行奖励设计转换后应用的额外折扣因子。
我们还引入了响应函数的概念，以帮助理解 $\gamma$ 和 $\lambda$ 引入的偏差。

奖励设计 (Ng et al.,1999) 是指对 MDP 的奖励函数进行如下变换:设 $\Phi:{\cal S}→{\mathbb R}$ 为状态空间上的任意标量价值函数，将变换后的奖励函数 $\widetilde r$ 定义为：

$\widetilde r(s,a,s^\prime)=r(s,a,s^\prime)+\gamma\Phi(s^\prime)-\Phi(s)~~~~~~~~~~(20)$

反过来定义转换后的 MDP。
该变换使折扣优势函数 $A^{\pi,\gamma}$ 对任何策略 $π$ 都不变。
为了理解这一点，考虑从状态 $s_t$ 开始的轨迹的奖励折扣和:

$\sum\limits_{l=0}^\infty\gamma^l\widetilde r(s_{t+l},a_t,s_{t+l+1})=\sum\limits_{l=0}^\infty\gamma^l\textcolor{blue}{r}(s_{t+l},a_{t+\textcolor{blue}{l}},s_{t+l+1})\textcolor{blue}{-\Phi(s_t)}~~~~~~~~~~(21)$

令 $\widetilde Q^{\pi,\gamma}, \widetilde V^{\pi,\gamma}, \widetilde A^{\pi,\gamma}$ 为变换后的 MDP 的价值函数和优势函数，由这些量的定义可得：

$\widetilde Q^{\pi,\gamma}(s,a)=Q^{\pi,\gamma}(s,a)-\Phi(s)~~~~~~~~~~(22)$

$\widetilde V^{\pi,\gamma}(s,a)=V^{\pi,\gamma}(s)-\Phi(s)~~~~~~~~~~(23)$

$\widetilde A^{\pi,\gamma}(s,a)=\Big(Q^{\pi,\gamma}(s,a)-\Phi(s)\Big)-\Big(V^{\pi,\gamma}(s)-\Phi(s)\Big)=A^{\pi,\gamma}(s,a)~~~~~~~~~~(24)$

注意，如果 $Φ$ 恰好是来自原始 MDP 的状态-价值函数state-value function $V^{\pi,\gamma}$ ，那么变换后的 MDP 具有有趣的性质， $\widetilde V^{\pi,\gamma}(s)$ 在每个状态下都为零。

注意 (Ng et al.,1999) 表明，当我们的目标是最大化奖励的折扣总和 $\sum\limits_{t=0}^\infty \gamma^tr(s_t, a_t, s_{t+1})$ 时，奖励设计变换使策略梯度和最优策略保持不变。
相比之下，本文关注的是最大化无折扣奖励和，其中折扣 $\gamma$ 作为方差减小参数。

在回顾了奖励设计reward shaping 的idea 之后，让我们考虑如何使用它来获得策略梯度估计。
最自然的方法是构造使用设计的奖励 $\widetilde r$ 折扣和的策略梯度估计器。
然而，式 (21) 表明，我们得到了原始 MDP 奖励 $r$ 减去基线项的折扣和。
接下来，让我们考虑使用“更陡峭”的折扣 $\gamma λ$ ，其中 $\leq\lambda\leq1$ 。
很容易看出，设计的奖励 $\widetilde r$ 等于 Bellman残差项 $\Delta ^V$ ，在第 3 节中，我们令 $\Phi=V$ 。
令 $\Phi=V$ ，我们得到

$\sum\limits_{l=0}^\infty(\gamma\lambda)^l\widetilde r(s_{t+l},a_t,s_{t+l+1})=\sum\limits_{l=0}^\infty(\gamma\lambda)^l\Delta_{t+l}^V=\hat A_t^{\text{GAE}(\gamma,\lambda)}~~~~~~~~~~(25)$

因此，通过考虑设计的奖励的 $\gamma\lambda$ 折扣和，我们正得到了第 3 节的广义优势估计器。
如前所示， $\lambda= 1$ 给出 $g^\gamma$ 的无偏估计，而 $\lambda <1$ 给出有偏估计。

为了进一步分析这种设计变换以及参数 $\gamma$ 和 $\lambda$ 的影响，引入响应函数 $\chi$ 的概念是有用的，我们将其定义为:

$\chi(l;s_t,a_t)={\mathbb E}[r_{t+1}|s_t,a_t]-{\mathbb E}[r_{t+1}|s_t]~~~~~~~~~~(26)$

注意， $A^{\pi,\gamma}(s, a) = \sum\limits_{l=0}^\infty \gamma^l\chi(l;s, a)$ ，因此响应函数跨时间步分解优势函数。
响应函数让我们量化了时序信度分配问题：动作和奖励之间的长期依赖关系对应于 $l\gg0$ 时的响应函数的非零值。

$\chi$ $~~~\chi$

接下来，让我们回顾一下折扣因子 $\gamma$ 和我们使用 $A^{\pi,\gamma}$ 而不是 $A^{\pi,1}$ 所做的近似。
式 (6) 中的折扣策略梯度估计器具有如下形式的项和

$\nabla_\theta\log \pi_\theta(a_t|s_t)\textcolor{blue}{A^{\pi,\gamma}(s_t,a_t)}=\nabla_\theta\log \pi_\theta(a_t|s_t)\textcolor{blue}{\sum\limits_{l=0}^\infty\gamma^l\chi(l;s_t,a_t)}~~~~~~~~~~(27)$

使用折扣 $\gamma < 1$ 对应于去掉 $l\gg\frac{1}{1-\gamma}$ 的项。
因此，如果 $\chi$ 随着 $l$ 的增加而迅速衰减，即如果在 $≈\frac{1}{1-\gamma}$ 时间步之后，动作对奖励的影响被“遗忘”，则由这种近似引入的误差将很小。

如果使用 $V^{\pi, \gamma}$ 获得奖励函数 $\widetilde r$ ，则当 $l > 0$ 时，我们将得到 ${\mathbb E}[\widetilde r_{t+l}| s_t, a_t] = {\mathbb E}[\widetilde r_{t+l}| s_t] = 0$ ，即响应函数仅在 $l = 0$ 时是非零的。
因此，这种设计转换将把暂时延长的响应转变为即时响应。
假设 $V^{\pi,\gamma}$ 完全减小了响应函数的时序扩散，我们可以期望一个好的近似 $V\approx V^{\pi,\gamma}$ 部分减小它。
这一观察结果表明了对式 (16) 的解释：使用 $V$ 重新设计奖励以缩小响应函数的时序范围，然后引入“更陡峭”的折扣 $\gammaλ$ 来切断长延迟引起的噪声，即当 $l\gg\frac{1}{1-\gamma \lambda}$ 忽略项 $\nabla_\theta \log π_\theta(a_t | s_t)\Delta_{t+l}^V$ 。

5 价值函数估计

↓ 【估计价值函数的蒙特卡洛Monte Carlo 或 TD(1) 方法】

可以使用各种不同的方法来估计价值函数 (例如，参见Bertsekas(2012))。
当使用非线性函数近似器表示价值函数时，最简单的方法是求解非线性回归问题:

$\underset{\phi}{\text{minimize}}\sum\limits_{n=1}^N \Vert V_\phi(s_n)-\hat V_n\Vert^2~~~~~~~~~~(28)$

其中 $\hat V_t = \sum\limits_{l=0}^\infty \gamma^lr_{t+l}$ 是奖励的折扣和，
在一批轨迹的所有时间步的索引为 $n$ 。共 N 条轨迹
这有时被称为估计价值函数的蒙特卡洛Monte Carlo 或 TD(1) 方法(Sutton & Barto, 1998)

脚注：另一个自然的选择是使用基于 TD( $\lambda$ )备份backup 的估计器来计算目标价值(Bertsekas,2012;Sutton & Barto, 1998)，反映了我们用于策略梯度估计的表达式： $\hat V_t^\lambda = V_{\phi_\text{old}} (s_n)+ \sum\limits_{l=0}^\infty(\gamma \lambda)^l\Delta_{t+l}$ 。
当我们尝试这种选择时，我们没有看到与式 (28) 中的 $λ = 1$ 的估计器在性能上的差异。

对于本工作中的实验，我们使用信任域方法在批量优化过程的每次迭代中优化价值函数。
信任区域帮助我们避免了对最近一批数据的过拟合。
为了构造信任域问题，我们首先计算 $\sigma^2=\frac{1}{N}\Vert V_\phi (s_n)-\hat V_n\Vert^2$ ，其中 $\phi_{\text{old}}$ 为优化前的参数向量。
然后求解如下约束优化问题:

$\underset{\phi}{\text{minimize}}\sum\limits_{n=1}^N \Vert V_\phi(s_n)-\hat V_n\Vert^2~~~~$ 式 (28)

$\text{subject to}~~~\frac{1}{N}\sum\limits_{n=1}^N\frac{ \Vert V_\phi(s_n)-\hat V_n\Vert^2}{2\sigma^2}\leq \epsilon~~~~~~~~~~(29)$

这个约束相当于约束前一个价值函数 和 新的价值函数之间的平均 KL 散度小于 $\epsilon$ ，其中价值函数被用来参数化一个均值为 $V_\phi(s)$ ，方差为 $\sigma^2$ 的条件高斯分布。

我们使用共轭梯度算法计算信任域问题的近似解 (Wright & Nocedal, 1999)。
具体来说，我们是在解二次方程

$\underset{\phi}{\text{minimize}}~~~g^T(\phi-\phi_\text{old})~~~~~~~~~~(29)$

$\text{object to}~~~\frac{1}{N}\sum\limits_{n=1}^N({\phi-\phi_\text{old}})^TH({\phi-\phi_\text{old}})\leq \epsilon~~~~~~~~~~(30)$

其中 $g$ 为目标函数的梯度，且 $\frac{1}{N}\sum\limits_{n}j_nj_n^T$ ，其中 $j_n = \nabla_\phi V(s_n)$ 。
注意， $H$ 是目标的 Hessian 的“高斯-牛顿”近似，当将价值函数解释为条件概率分布时，它是 Fisher 信息矩阵(最高为 $\sigma^2$ 因子)。
利用矩阵向量积 $v \to H v$ 实现共轭梯度算法，计算出 step 方向 $s\approx H^{-1}g$ 。
然后我们重新调整 $s→\alpha s$ ，使 $\frac{1}{2}(\alpha s)^TH(\alpha s) = \epsilon$ ，并令 $\phi = \phi_\text{old} + \alpha s$ 。
此过程类似于我们用于更新策略的过程，该过程将在第 6 节中进一步描述，并基于 Schulman 等人(2015) 的工作。

6 实验

我们设计了一组实验来调查以下问题:

在使用广义优势估计优化回合式总奖励时，改变 $\lambda \in[0,1]$ 和 $\gamma \in[0,1]$ 的经验效应是什么?
广义优势估计，以及策略和价值函数优化的信任域算法，是否可以用于优化大型神经网络策略以解决具有挑战性的控制问题?

6.1 策略优化算法

虽然广义优势估计可以与各种不同的策略梯度方法一起使用，但对于这些实验，我们使用信任域策略优化( trust region policy optimization, TRPO) 进行策略更新(Schulman et al.， 2015)。
每次迭代，TRPO 通过近似求解以下约束优化问题来更新策略:

$\underset{\theta}{\text{minimize}} ~L_{\theta_\text{old}}(\theta)$

$\text{subject to}~\overline D_\text{KL}^{\theta_\text{old}}(\pi_{\theta_\text{old}},\pi_\theta)\leq\epsilon$

其中 $L_{\theta_\text{old}}(\theta)=\frac{1}{N}\sum\limits_{n=1}^N\frac{\pi_\theta(a_n|s_n)}{\pi_{\theta_\text{old}}(a_n|s_n)}\hat A_n$

$~~~~~~~\overline D_\text{KL}^{\theta_\text{old}}(\pi_{\theta_\text{old}},\pi_\theta)=\frac{1}{N}\sum\limits_{n=1}^ND_\text{KL}\Big(\textcolor{blue}{\pi_{\theta_\text{old}}}(·|s_n)\Big\Vert \textcolor{blue}{\pi_\theta}(·|s_n)\Big)~~~~~~~~~~~~~(31)$

如(Schulman et al.， 2015)所述，我们通过对目标函数进行线性化并对约束进行二次化来近似地解决这个问题，从而得到在 $\theta_\text{old}\propto-F^{-1}g$ 方向上的一个 step，其中 $F$ 是平均 Fisher 信息矩阵， $g$ 是策略梯度估计。
这种策略更新产生与自然策略梯度 (Kakade, 2001a) 和自然 actor-critic (Peters & Schaal, 2008)相同的步长方向，但它使用不同的步长确定方案和计算步长的数值过程。

为完整起见，迭代更新 策略和价值函数的整个算法如下:

在这里插入图片描述

初始化策略的参数 $\theta_0$ 和价值函数的参数 $\phi_0$
${\bf for~} i=0,1,2,\cdots,~{\bf do}$
模拟当前策略 $\pi_{\theta_i}$ 直到获得 $N$ 个时间步
使用 $V=V_{\phi_i}$ ，在所有时间步 $t\in \{1,2,\cdots,N\}$ 计算 $\Delta_t^V~~~~~~~~~~~~~~~~~~\textcolor{blue}{\Delta_t^V=-V(s_t)+r_t +\gamma V(s_{t+1})}$
在所有时间步计算 $\hat A_t=\sum\limits_{l=0}^\infty(\gamma \lambda)^l\textcolor{blue}{\Delta_{t+l}^V}$
计算 $\theta_{i+1}$ 。式 (31) ↓
计算 $\phi_{i+1}$ 。式 (30) ↓
${\bf end ~for}$

在这里插入图片描述

注意，策略更新 $\theta_i→θ_{i+1}$ 是使用价值函数 $V_{\phi_i}$ (用于优势估计) 执行的，而不是使用 $V_{\phi_{i+1}}$ 。
如果我们先更新价值函数，就会引入额外的偏差。
为了了解这一点，考虑极端情况，我们过拟合价值函数，并且Bellman 残差 $r_t + \gamma V(s_{t+1}) - V(s_t)$ 在所有时间步长都变为零——策略梯度估计值将为零。

6.2 实验设置

我们在以下任务评估了我们的方法：经典的推车-杆平衡问题，以及几个具有挑战性的 3D 运动任务：
(1) 两足运动;
(2) 四足运动;
(3) 动态站起来，对于两足动物来说，它开始时是仰卧的。
模型如图 1 所示。

在这里插入图片描述

6.2.1 网络架构

我们对所有的 3D 机器人任务使用了相同的神经网络架构，这是一个有 3 个隐藏层的前馈网络，分别有100, 50 和 25 个 tanh 单元。
策略和价值函数使用了相同的神经网络架构。
最后的输出层是线性激活的。
价值函数估计器使用相同的架构，但只有一个标量输出。
对于更简单的 cart-pole 任务，我们使用了线性策略和一个包含 20 个单元的隐藏层的神经网络作为价值函数。

6.2.2 任务细节

对于 cart-pole 平衡任务，我们使用 Barto等人(1983)的物理参数，每批收集 20 个轨迹，最大长度为 1000 个时间步长。

模拟机器人任务使用 MuJoCo 物理引擎进行模拟 (Todorov et al.， 2012)。
人形机器人模型有 33 个状态维度和 10 个驱动自由度，四足机器人模型有 29 个状态维度和 8 个驱动自由度。
这些任务的初始状态由以参考配置为中心的均匀分布组成。
对于两足运动，我们每批使用 50000 个时间步长50000 timesteps per batch，对于四足运动和两足站立，我们每批使用 200000 个时间步长 200000 timesteps per batch。
如果机器人事先没有达到终止状态，每一个回合将在 2000 个时间步后终止。
时间步长为 0.01 秒。

奖励函数如下表所示。

在这里插入图片描述

在运动任务中，如果 actor 的质心低于预先设定的高度(两足动物为 0.8 米，四足动物为 0.2 米)，则该回合终止。
奖励函数中的常数偏移量鼓励更长的回合；否则，二次奖励项可能会导致尽快结束回合的策略。

6.3 实验结果

所有结果都以损失表示，cost 被定义为负奖励，并被最小化。
习得的策略的视频可以在 https://sites.google.com/site/ gaepapersupp 上找到。
在图中，“No VF ”意味着我们使用了不依赖于状态的时间相关基线，而不是对状态价值函数的估计。
时间相关的基线是通过在批次的轨迹上平均每个时间步的回报来计算的。

6.3.1 Cart-pole

结果是用不同随机种子进行的 21 次实验的平均值。
结果如图 2 所示，在参数 $γ\in [0.96, 0.99]$ 和 $λ\in[0.92,0.99]$ 的中间值处获得最佳效果。

在这里插入图片描述

图 2：左：推车杆cart-pole 任务的学习曲线，在 $\gamma = 0.99$ 时使用不同 $\lambda$ 值的广义优势估计。
通过 $\lambda$ 在 [0.92,0.98] 范围内的中间值获得最快的策略改进。
右：在不同的 $\gamma$ 和 $\lambda$ 下，策略优化 20 次迭代后的性能。白色意味着更高的奖励。在两者的中间值处得到最佳结果。

6.3.2 3D 双足运动

每次试验在 16 核机器上运行大约需要 2 小时，其中并行化了模拟部署，以及优化策略和价值函数时使用的函数、梯度和矩阵向量乘积评估。
这里，使用不同随机种子的 9 个试验的平均结果。
使用中间值 $\gamma \in [0.99, 0.995]，\lambda \in[0.96,0.99]$ 再次获得最佳性能。
1000 次迭代后的结果是快速、平滑、稳定的步态，实际上是完全稳定的。
我们可以计算在这个学习过程中使用了多少“实时real”时间：

因此，如果有一种方法可以重置机器人的状态并确保它不会损坏自己，那么这个算法可以在一个真实的机器人上运行，或者在多个并行学习的真实机器人上运行。

在这里插入图片描述

图 3:
左： 3D 双足运动的学习曲线，算法在 9 次运行中平均。
右： 3D 四足运动的学习曲线，对 5 次运行进行平均。

6.3.3 其他 3D 机器人任务

考虑的另外两种运动行为是四足运动和 3D 双足动物离开地面。
同样，我们在每个实验条件下执行了 5 次试验，使用不同的随机种子(和初始化)。
在一台 32 核的机器上，每次试验大约需要 4 个小时。
我们对这些领域进行了更有限的比较(由于运行这些实验需要大量的计算资源)，固定 $\gamma = 0.995$ ，但变化 $\lambda\in\{1,0.96\}$ ，以及没有价值函数的实验条件。
对于四足运动，使用 6.3.2 节中 $\gamma=0.96$ 的价值函数获得最佳结果。
对于 3D 站立，价值函数总是有帮助的，但对于 $\gamma= 0.96$ 和 $\lambda= 1$ ，结果大致相同。

在这里插入图片描述

图 4:
(a) 四足行走学习曲线，
(b) 3D 站立学习曲线，
(c) 3D 站立片段。

7 讨论

Policy gradient methods provide a way to reduce reinforcement learning to stochastic gradient descent, by providing unbiased gradient estimates.
策略梯度方法通过提供无偏梯度估计，提供了一种将强化学习转成随机梯度下降的方法。
However, so far their success at solving difficult control problems has been limited, largely due to their high sample complexity.
然而，到目前为止，它们在解决困难的控制问题上的成功是有限的，很大程度上是由于他们的高样本复杂度。
We have argued that the key to variance reduction is to obtain good estimates of the advantage function.
我们认为，减小方差的关键是获得优势函数的良好估计。

We have provided an intuitive but informal analysis of the problem of advantage function estimation, and justified the generalized advantage estimator, which has two parameters $\gamma, \lambda$ which adjust the bias-variance tradeoff.
我们对优势函数估计问题提供了一个直观但非正式的分析，并证明了广义优势估计器，它有两个参数 $\gamma, \lambda$ 来调节偏差-方差权衡。
We described how to combine this idea with trust region policy optimization and a trust region algorithm that optimizes a value function, both represented by neural networks.
我们描述了如何将这一思想与信任域策略优化和优化价值函数的信任域算法结合起来，这两种算法都由神经网络表示。
Combining these techniques, we are able to learn to solve difficult control tasks that have previously been out of reach for generic reinforcement learning methods.
结合这些技术，我们能够学习解决以前通用强化学习方法无法实现的困难控制任务。

↓ 【后续工作展望 1：价值函数估计误差与策略梯度估计误差之间的关系。】

One question that merits future investigation is the relationship between value function estimation error and policy gradient estimation error.
值得进一步研究的一个问题是价值函数估计误差与策略梯度估计误差之间的关系。
If this relationship were known, we could choose an error metric for value function fitting that is well-matched to the quantity of interest, which is typically the accuracy of the policy gradient estimation.
如果这个关系是已知的，我们可以为价值函数拟合选择一个误差指标，它与感兴趣的量非常匹配，这通常是策略梯度估计的准确性。
Some candidates for such an error metric might includethe Bellman error or projected Bellman error, as described in Bhatnagar et al. (2009).
如 Bhatnagar 等人(2009) 所述，这种误差指标的一些候选者可能包括 Bellman 误差或预测 Bellman 误差。

↓ 【后续工作展望 2：共享函数近似架构 ——> 特征共用，更快地学习】

Another enticing possibility is to use a shared function approximation architecture for the policy and the value function, while optimizing the policy using generalized advantage estimation.
另一种诱人的可能性是为策略和价值函数使用共享函数近似架构，同时使用广义优势估计优化策略。
While formulating this problem in a way that is suitable for numerical optimization and provides convergence guarantees remains an open question, such an approach could allow the value function and policy representations to share useful features of the input, resulting in even faster learning.
虽然以一种适合于数值优化并提供收敛保证的方式来表述这个问题仍然是一个悬而未决的问题，但这种方法可以让价值函数和策略表示 共享输入的有用特征，从而更快地学习。

↓ 【同期工作比较】

In concurrent work, researchers have been developing policy gradient methods that involve differentiation with respect to the continuous-valued action (Lillicrap et al., 2015; Heess et al., 2015). 【同期工作有哪些？】
在并行工作中，研究人员一直在开发涉及连续值动作 微分的策略梯度方法 (Lillicrap等人，2015;Heess et al.， 2015)。
While we found empirically that the one-step return ( $\lambda= 0$ ) leads to excessive bias and poor performance, these papers show that such methods can work when tuned appropriately. 【同期工作的主要发现】
我们从经验上发现一步回报 ( $\lambda= 0$ ) 会导致过大的偏差和较差的性能，但这些论文表明，这些方法在适当调整时可以工作。
However, note that those papers consider control problems with substantially lower-dimensional state and action spaces than the ones considered here. 【本工作与同期工作的区别之处】
然而，请注意，那些论文考虑的控制问题具有比本文考虑的低维状态和动作空间的问题。
A comparison between both classes of approach would be useful for future work.
两类方法之间的比较将有助于今后的工作。

致谢

我们感谢 Emo Todorov 提供了模拟器以及富有见地的讨论，
我们感谢 Greg Wayne、Yuval Tassa、Dave Silver、Carlos Florensa Campo 和 Greg Brockman 进行了富有见地的讨论。
这项研究的部分资金是由海军研究办公室通过一个年轻研究者奖和拨款编号 NO0014-11-1-0688, DARPA 通过青年教师奖，由陆军研究办公室通过 MAST 项目获得。

A 常见问题

A.1 与兼容特征的关系是什么?
与使用价值函数的策略梯度算法相关的兼容特征经常被提到，Konda 和 Tsitsiklis (2003) 在《On Actors - Critic Methods》一文中提出了这个想法。
这些作者指出，由于策略的表示能力有限，策略梯度只依赖于优势函数空间的某一个子空间。
这个子空间是由相容特征 $\nabla_{\theta_i} \log π_\theta(a_t| s_t)$ 张成的，其中 $\in\{1,2,\cdots,\dim \theta\}$ 。
这种兼容特征理论没有提供如何利用问题的时序结构来获得更好的优势函数估计 的指导，使其与本文的思想ideas 基本正交。

兼容特征的想法激发了一种计算自然策略梯度的优雅方法(Kakade, 2001;Peters & Schaal, 2008)。
给定优势函数 $\hat A_t$ 在每个时间步的经验估计，我们可以通过求解以下最小二乘问题将其投影到兼容特征的子空间:

$\underset{\bf r}{\text{minimize}}\sum\limits_{t}\Vert{\bf r}·\nabla_\theta\log \pi_\theta(a_t|s_t)-\hat A_t\Vert^2~~~~~~~~~~(32)$

如果 $\hat A_t$ 是 $\gamma$ -just，则最小二乘解是自然策略梯度(Kakade, 2001a)。
注意，任何优势函数的估计量都可以代入这个公式，包括我们在本文中推导的估计量。
对于我们的实验，我们也计算自然策略梯度步steps，但我们使用 Schulman 等人(2015) 的计算效率更高的数值过程，如第 6 节所述。

A.2 为什么不直接用 $Q$ 函数呢?
先前的 actor-critic 方法，例如 Konda 和 Tsitsiklis(2003)，使用 Q 函数来获得潜在的低方差策略梯度估计。
最近的论文包括 Heess et al. (2015);Lillicrap 等人(2015) 已经证明，神经网络 $Q$ 函数近似器可以有效地用于策略梯度方法。
然而，以本文的方式使用状态-价值函数有几个优点。
首先，状态-价值函数的输入维度较低，因此比状态-动作价值函数 更容易学习。
其次，本文的方法允许我们在高偏差估计器 ( $\lambda$ = 0) 和低偏差估计器 ( $\lambda$ = 1)之间平滑地插值。
另一方面，使用参数化 $Q$ 函数只允许我们使用高偏差估计器。
我们发现，当使用回报的一步估计时，即 $\lambda$ = 0 估计器， $Â_t = \Delta_t^V= r_t +\gamma V(s_{t+1}) - V(s_t)$ ，偏差是非常大的。
我们预计，当使用涉及参数化 $Q$ 函数的优势估计器 $Â_t =Q(s, a) - V(s)$ 时，也会遇到类似的困难。
有一个有趣的可能算法空间，使用参数化的 $Q$ 函数并尝试减少偏差，然而，对这些可能性的探索超出了本工作的范围。

使用参数化的 $Q$ 函数并尝试减少偏差

B 证明

在这里插入图片描述

命题 1 证明：首先我们可以把期望分成包含 $Q$ 和 $b$ 的项，

$\begin{aligned}&{\mathbb E}_{s_{0:\infty}, a_{0:\infty}}\Big[\nabla_\theta\log \pi_\theta(a_t|s_t)\Big(Q_t(s_{0:\infty}, a_{0:\infty})-b_t(s_{0:t},a_{0:t-1})\Big)\Big]\\ &={\mathbb E}_{s_{0:\infty}, a_{0:\infty}}[\nabla_\theta\log \pi_\theta(a_t|s_t)Q_t(s_{0:\infty}, a_{0:\infty})]\\ &~~~~~-{\mathbb E}_{s_{0:\infty}, a_{0:\infty}}[\nabla_\theta\log \pi_\theta(a_t|s_t)b_t(s_{0:t},a_{0:t-1})]~~~~~~~~~~(33)\end{aligned}$

我们将依次考虑 $Q$ 项和 $b$ 项。

$\begin{aligned}&{\mathbb E}_{s_{0:\infty}, a_{0:\infty}}[\nabla_\theta\log \pi_\theta(a_t|s_t)Q_t(s_{0:\infty}, a_{0:\infty})]\\ &={\mathbb E}_{s_{0:t}, a_{0:t}}\Big[{\mathbb E}_{s_{t+1:\infty}, a_{t+1:\infty}}[\nabla_\theta\log \pi_\theta(a_t|s_t)Q_t(s_{0:\infty}, a_{0:\infty})]\Big]\\ &={\mathbb E}_{s_{0:t}, a_{0:t}}\Big[\nabla_\theta\log \pi_\theta(a_t|s_t){\mathbb E}_{s_{t+1:\infty}, a_{t+1:\infty}}[Q_t(s_{0:\infty}, a_{0:\infty})]\Big]~~~~~~\textcolor{blue}{注意时间步，后一个求期望只考虑时间步~ t + 1 ~及之后的，因此可以把梯度项移到期望求解之外}\\ &={\mathbb E}_{s_{0:t}, a_{0:t}}\Big[\nabla_\theta\log \pi_\theta(a_t|s_t)\textcolor{blue}{A^\pi(s_t,a_t)}\Big]\end{aligned}$

$\begin{aligned}&{\mathbb E}_{s_{0:\infty}, a_{0:\infty}}[\nabla_\theta\log \pi_\theta(a_t|s_t)b_t(s_{0:t},a_{0:t-1})]\\ &={\mathbb E}_{s_{0:t}, a_{0:t-1}}\Big[{\mathbb E}_{s_{t+1:\infty}, a_{t:\infty}}[\nabla_\theta\log \pi_\theta(a_t|s_t)b_t(s_{0:t},a_{0:t-1})]\Big]\\ &={\mathbb E}_{s_{0:t}, a_{0:t-1}}\Big[\underbrace{{\mathbb E}_{s_{t+1:\infty}, a_{t:\infty}}[\nabla_\theta\log \pi_\theta(a_t|s_t)\textcolor{blue}{]}}_{0}b_t(s_{0:t},a_{0:t-1})\Big]\\ &={\mathbb E}_{s_{0:t}, a_{0:t-1}}\Big[0·b_t(s_{0:t},a_{0:t-1})\Big]\\ &=0\end{aligned}$