ICML 2023 Poster
paper
1 Introduction
O2O容易因为分布偏移导致策略崩溃,解决方法包括限制策略偏移计以及平衡样本采样等。然而这些方法需要求解分布散度或者密度比(density ratio)。为了避免这些复杂操作,本文并不采用以往AC方法对Q值进行变形,而是对离线策略进行对齐,即使面对离线策略外的动作的Q值依旧能被限制。因此,在线微调就能如同一般AC方法执行。
方法的核心来自于SAC的策略表示,它与Q值softmax操作密切相关,该形式让策略与Q值联系在一起。
π
θ
(
a
∣
s
)
=
exp
(
1
α
Q
μ
(
s
,
a
)
)
/
∑
a
∈
A
exp
(
1
α
Q
μ
(
s
,
a
)
)
\pi_\theta(a|s)=\exp{(\frac{1}{\alpha}Q_\mu(s,a))}\bigg/\sum_{a\in\mathcal{A}}\exp{(\frac{1}{\alpha}Q_\mu(s,a))}
πθ(a∣s)=exp(α1Qμ(s,a))/a∈A∑exp(α1Qμ(s,a))
设置
Z
(
s
)
=
α
log
∑
a
∈
A
e
x
p
(
Q
μ
(
s
,
a
)
/
α
)
Z(s)=\alpha\log\sum_{a\in \mathcal{A}} exp(Q_\mu(s,a)/\alpha)
Z(s)=αlog∑a∈Aexp(Qμ(s,a)/α),上式进行简写为:
Q
μ
(
s
,
a
)
=
Z
(
s
)
+
α
log
π
θ
(
a
∣
s
)
Q_{\mu}(s,a)=Z(s)+\alpha\log\pi_{\theta}(a|s)
Qμ(s,a)=Z(s)+αlogπθ(a∣s)
Q值由于OOD的存在可能存在错误估计,但是策略是值得信赖的(但是需要在线微调)。由上式可以看出,SAC的离线策略自然将critic与actor对齐。它允许我们对online阶段的Q值使用offline的策略进行初始化。在线微调时,只要采取SAC方法,所提出方法在各种任务上均表现优良。
2 Method
本文提出的方法包括三个阶段:1)offline 2)actor-critic alignment 3) online
2.1 Offline
2.1.1 actor update
对actor的更新采用SAC与最大似然(ML)相结合的方法,
L
π
S
A
C
+
M
L
(
θ
,
d
)
=
E
(
s
,
a
)
∼
d
b
∼
π
θ
(
⋅
∣
s
)
[
−
log
π
θ
(
a
∣
s
)
−
λ
(
Q
μ
(
s
,
b
)
−
α
log
π
θ
(
b
∣
s
)
)
]
\begin{aligned}\mathcal{L}_\pi^{\mathrm{SAC+ML}}(\theta,\mathrm{d})&=\mathbb{E}_{(s,a)\sim\mathrm{d~}b\sim\pi_\theta(\cdot|s)}\Big[-\log\pi_\theta(a|s)\\&-\lambda\Big(Q_\mu(s,b)-\alpha\log\pi_\theta(b|s)\Big)\Big]\end{aligned}
LπSAC+ML(θ,d)=E(s,a)∼d b∼πθ(⋅∣s)[−logπθ(a∣s)−λ(Qμ(s,b)−αlogπθ(b∣s))]
其中
λ
\lambda
λ超参数平衡二者,其取值计算如下:
λ
:
=
ω
/
E
(
s
,
a
)
∼
d
∣
Q
μ
(
s
,
a
)
∣
,
w
h
e
r
e
Q
μ
:
=
min
{
Q
μ
1
,
Q
μ
2
}
\lambda:=\omega/\underset{(s,a)\thicksim\mathbf{d}}{\operatorname*{\mathbb{E}}}|Q_\mu(s,a)|,\mathrm{~where~}Q_\mu:=\min\{Q_{\mu_1},Q_{\mu_2}\}
λ:=ω/(s,a)∼dE∣Qμ(s,a)∣, where Qμ:=min{Qμ1,Qμ2}
2.1.2 critic update
critic的更新采用SAC的模式,并对temperature
α
\alpha
α通过梯度进行更新。
L
Q
S
A
C
+
M
L
.
(
μ
i
,
d
)
:
=
E
(
s
,
a
,
r
,
s
′
)
∼
d
[
(
Q
μ
i
(
s
,
a
)
−
y
(
r
,
s
′
)
)
2
]
w
i
t
h
y
(
r
,
s
′
)
:
=
r
+
γ
E
a
′
∼
π
θ
(
⋅
∣
s
′
)
[
Q
μ
ˉ
(
s
′
,
a
′
)
−
α
log
π
θ
(
a
′
∣
s
′
)
]
,
\begin{aligned}&\mathcal{L}_Q^{\mathrm{SAC+ML.}}(\mu_i,\mathbf{d}):=\mathbb{E}_{(s,a,r,s^{\prime})\sim\mathbf{d}}\Big[(Q_{\mu_i}(s,a)-y(r,s^{\prime}))^2\Big]\text{ }\\&\mathrm{~with~}y(r,s^{\prime}):=r+\gamma\underset{a^{\prime}\sim\pi_\theta(\cdot|s^{\prime})}{\operatorname*{E}}[Q_{\bar{\mu}}(s^{\prime},a^{\prime})-\alpha\log\pi_\theta(a^{\prime}|s^{\prime})],\end{aligned}
LQSAC+ML.(μi,d):=E(s,a,r,s′)∼d[(Qμi(s,a)−y(r,s′))2] with y(r,s′):=r+γa′∼πθ(⋅∣s′)E[Qμˉ(s′,a′)−αlogπθ(a′∣s′)],
其中,
μ
ˉ
\bar{\mu}
μˉ表示延迟更新的target Q,且
Q
μ
ˉ
(
s
,
a
)
=
min
i
∈
{
1
,
2
}
Q
μ
ˉ
i
(
s
,
a
)
Q_{\bar{\mu}}(s,a)=\operatorname*{min}_{i\in\{1,2\}}Q_{\bar{\mu}_{i}}(s,a)
Qμˉ(s,a)=mini∈{1,2}Qμˉi(s,a)
L
t
e
m
p
S
A
C
+
M
L
(
α
,
d
)
:
=
−
α
E
s
∼
d
E
a
∼
π
θ
(
⋅
∣
s
)
[
log
π
θ
(
a
∣
s
)
−
H
ˉ
]
\mathcal{L}_{\mathbf{temp}}^{\mathbf{SAC+ML}}(\alpha,\mathbf{d}):=-\alpha\underset{s\sim\mathbf{d}}{\operatorname*{\mathbb{E}}}\underset{a\sim\pi_\theta(\cdot|s)}{\operatorname*{\mathbb{E}}}\left[\log\pi_\theta(a|s)-\bar{\mathcal{H}}\right]
LtempSAC+ML(α,d):=−αs∼dEa∼πθ(⋅∣s)E[logπθ(a∣s)−Hˉ]
2.2 Align
离线阶段优化的策略通常表现良好,记作
π
θ
0
\pi_{\theta_0}
πθ0。而critic由于OOD可能导致崩溃。因此,本文提出一种对齐方法,将critic与离线策略相关联。这也得益于SAC策略表现形式,天然将二者相关联。这里对Q设置如下,此时
α
=
1
\alpha=1
α=1
Q
i
(
s
,
a
)
=
log
π
θ
0
(
a
∣
s
)
+
Z
ψ
i
(
s
)
\begin{aligned}Q_i(s,a)=\log\pi_{\theta_0}(a|s)+Z_{\psi_i}(s)\end{aligned}
Qi(s,a)=logπθ0(a∣s)+Zψi(s)
对
Z
ψ
i
(
s
)
Z_{\psi_i}(s)
Zψi(s)通过最小化bellman误差进行优化。
L
Z
S
A
C
+
M
L
(
ψ
i
,
d
)
:
=
E
(
s
,
a
,
r
,
s
′
)
∼
d
[
(
log
π
θ
0
(
a
∣
s
)
+
Z
ψ
i
(
s
)
−
y
(
r
,
s
′
)
)
2
]
\mathcal{L}_{Z}^{\mathrm{SAC+ML}}(\psi_{i},\mathbf{d}):=\underset{(s,a,r,s^{\prime})\sim\mathbf{d}}{\operatorname*{\mathbb{E}}}[(\log\pi_{\theta_0}(a|s) +Z_{\psi_i}(s)-y(r,s'))^2]
LZSAC+ML(ψi,d):=(s,a,r,s′)∼dE[(logπθ0(a∣s)+Zψi(s)−y(r,s′))2]
w
h
e
r
e
y
(
r
,
s
′
)
:
=
r
+
γ
E
a
′
∼
π
θ
0
(
⋅
∣
s
′
)
[
log
π
θ
0
(
a
′
∣
s
′
)
+
Z
ψ
(
s
′
)
]
,
Z
ψ
:
=
min
{
Z
ψ
1
,
Z
ψ
2
}
.
\begin{aligned}\mathrm{where~}y(r,s^{\prime}):=&r+\gamma\underset{a^{\prime}\sim\pi_{\theta_0}(\cdot|s^{\prime})}{\operatorname*{\mathbb{E}}}[\log\pi_{\theta_0}(a^{\prime}|s^{\prime})+Z_{\psi}(s^{\prime})],\\&Z_{\psi}:=\min\{Z_{\psi_1},Z_{\psi_2}\}.\end{aligned}
where y(r,s′):=r+γa′∼πθ0(⋅∣s′)E[logπθ0(a′∣s′)+Zψ(s′)],Zψ:=min{Zψ1,Zψ2}.
得益于对齐步骤,天然忽略了离线阶段优化的Q值,避免使用离线阶段错误的Q导致在线阶段的崩溃。在线阶段自然而然使用SAC在线微调
由上图可知,对齐后的策略能够有更好的性能表现,第二张图也展示策略与Q的对齐效果。
2.3 Online
在线微调阶段,Q值函数可以初始化表示如下
Q
ϕ
i
(
s
,
a
)
:
=
log
π
θ
0
(
a
∣
s
)
+
R
ϕ
i
(
s
,
a
)
,
w
h
e
r
e
R
ϕ
i
(
s
,
a
)
is initialized with
Z
ψ
i
(
s
)
\begin{aligned}Q_{\phi_i}(s,a)&:=\log\pi_{\theta_0}(a|s)+R_{\phi_i}(s,a),\\&\mathrm{where}\quad R_{\phi_i}(s,a)\text{ is initialized with }Z_{\psi_i}(s)\end{aligned}
Qϕi(s,a):=logπθ0(a∣s)+Rϕi(s,a),whereRϕi(s,a) is initialized with Zψi(s)
对critic 的优化采用SAC的方法:
L
Q
(
ϕ
i
,
d
)
:
=
E
d
[
(
log
π
θ
0
(
a
∣
s
)
+
R
ϕ
i
(
s
,
a
)
−
y
(
r
,
s
′
)
)
2
]
where
y
(
r
,
s
′
)
:
=
r
+
γ
E
a
′
∼
π
θ
(
⋅
∣
s
′
)
[
log
π
θ
0
(
a
′
∣
s
′
)
+
R
ϕ
ˉ
(
s
′
,
a
′
)
−
α
log
π
θ
(
a
′
∣
s
′
)
]
\begin{aligned} &\mathcal{L}_Q(\phi_i,\mathbf{d}):=\underset{\mathbf{d}}{\operatorname*{\mathbb{E}}}\left[\left(\log\pi_{\theta_0}(a|s)+R_{\phi_i}(s,a)-y(r,s^{\prime})\right)^2\right] \\ &\begin{aligned}\text{where }&y(r,s'):=r+\gamma&\mathbb{E}_{a'\sim\pi_\theta(\cdot|s')}[\log\pi_{\theta_0}(a'|s')\left.+R_{\bar{\phi}}(s^{\prime},a^{\prime})-\alpha\log\pi_{\theta}(a^{\prime}|s^{\prime})\right]\end{aligned} \end{aligned}
LQ(ϕi,d):=dE[(logπθ0(a∣s)+Rϕi(s,a)−y(r,s′))2]where y(r,s′):=r+γEa′∼πθ(⋅∣s′)[logπθ0(a′∣s′)+Rϕˉ(s′,a′)−αlogπθ(a′∣s′)]
actor的优化依旧类似于SAC,其中
R
ϕ
:
=
min
i
∈
{
1
,
2
}
R
ϕ
i
,
Q
ϕ
:
=
log
π
θ
0
+
R
ϕ
,
R_{\phi}:=\operatorname*{min}_{i\in\{1,2\}}R_{\phi_{i}},Q_{\phi}:=\operatorname{log}\pi_{\theta_{0}}+R_{\phi},
Rϕ:=mini∈{1,2}Rϕi,Qϕ:=logπθ0+Rϕ,
L
π
(
θ
,
d
)
:
=
−
E
s
∼
d
E
a
∼
π
θ
(
⋅
∣
s
)
[
Q
ϕ
(
s
,
a
)
−
α
log
π
θ
(
a
∣
s
)
]
,
=
−
E
s
∼
d
E
a
∼
π
θ
(
⋅
∣
s
)
[
R
ϕ
(
s
,
a
)
−
α
log
π
θ
(
a
∣
s
)
]
−
E
s
∼
d
E
a
∼
π
θ
(
⋅
∣
s
)
[
log
π
θ
0
(
a
∣
s
)
]
⏟
penalizing deviation of
π
θ
from
π
θ
0
\begin{aligned} \mathcal{L}_\pi(\theta,\mathbf{d}) :&=-\underset{s\sim\mathbf{d}}{\operatorname*{\mathbb{E}}}\underset{a\sim\pi_\theta(\cdot|s)}{\operatorname*{\mathbb{E}}}\left[Q_\phi(s,a)-\alpha\log\pi_\theta(a|s)\right], \\ =&-\operatorname*{\mathbb{E}}_{s\sim\mathbf{d}}\operatorname*{\mathbb{E}}_{a\sim\pi_\theta(\cdot|s)}\left[R_\phi(s,a)-\alpha\log\pi_\theta(a|s)\right]\\ &-\underbrace{\mathbb{E}_{s\sim\mathbf{d}}\mathbb{E}_{a\sim\pi_\theta(\cdot|s)}[\log\pi_{\theta_0}(a|s)]}_{\text{penalizing deviation of $\pi_\theta$ from $\pi_{\theta_0}$}} \end{aligned}
Lπ(θ,d):==−s∼dEa∼πθ(⋅∣s)E[Qϕ(s,a)−αlogπθ(a∣s)],−s∼dEa∼πθ(⋅∣s)E[Rϕ(s,a)−αlogπθ(a∣s)]−penalizing deviation of πθ from πθ0
Es∼dEa∼πθ(⋅∣s)[logπθ0(a∣s)]
第二项的对数似然可看作时正则化项,使得新策略靠近离线优化的策略。