IEEE TAI 2024
paper
加权TD3_BC
Method
离线阶段,算法基于TD3_BC,同时加上基于Q函数的权重函数,一定程度上避免了过估计
J
o
f
f
l
i
n
e
(
θ
)
=
E
(
s
,
a
)
∼
B
[
ζ
Q
ϕ
(
s
,
π
θ
(
s
)
)
]
−
∥
π
θ
(
s
)
−
a
∥
2
\begin{aligned}J_{\mathrm{offline}}(\boldsymbol{\theta})&=\mathbb{E}_{(\boldsymbol{s},\boldsymbol{a})\sim\mathcal{B}}\left[\zeta Q_{\boldsymbol{\phi}}(\boldsymbol{s},\pi_{\boldsymbol{\theta}}(\boldsymbol{s}))\right]-\left\|\pi_{\boldsymbol{\theta}}(\boldsymbol{s})-\boldsymbol{a}\right\|^{2}\end{aligned}
Joffline(θ)=E(s,a)∼B[ζQϕ(s,πθ(s))]−∥πθ(s)−a∥2
其中权重
ζ
\zeta
ζ与Q函数关系如下,
ζ
=
α
1
m
∑
(
s
i
,
a
i
)
∈
B
‾
∣
Q
(
s
i
,
a
i
)
∣
\zeta=\frac{\alpha}{\frac{1}{m}\sum_{(s_{i},\boldsymbol{a}_{i})\in\overline{\mathcal{B}}}|Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i})|}
ζ=m1∑(si,ai)∈B∣Q(si,ai)∣α
在线阶段为了防止策略出现Performance drop, 对策略优化j保留BC项。如下::
J
o
n
l
i
n
e
(
θ
)
=
E
(
s
,
a
)
∼
B
[
ζ
Q
ϕ
(
s
,
π
θ
(
s
)
)
]
−
λ
∥
π
θ
(
s
)
−
a
∥
2
\begin{aligned}J_{\mathrm{online}}(\boldsymbol{\theta})&=\mathbb{E}_{(\boldsymbol{s},\boldsymbol{a})\sim\mathcal{B}}\left[\zeta Q_{\boldsymbol{\phi}}\left(\boldsymbol{s},\pi_{\boldsymbol{\theta}}(\boldsymbol{s})\right)\right]-\lambda\left\|\pi_{\boldsymbol{\theta}}(\boldsymbol{s})-\boldsymbol{a}\right\|^{2}\end{aligned}
Jonline(θ)=E(s,a)∼B[ζQϕ(s,πθ(s))]−λ∥πθ(s)−a∥2
价值函数通过最小化均方bellman误差:
L
(
ϕ
)
=
E
(
s
,
a
)
∼
B
[
(
y
ˉ
−
Q
ϕ
(
s
,
a
)
)
2
]
(
11
)
y
ˉ
=
r
+
min
i
=
1
,
2
Q
ϕ
ˉ
i
(
s
,
′
a
′
∼
π
θ
ˉ
)
.
L(\phi)=\mathbb{E}_{(\boldsymbol{s},\boldsymbol{a})\sim\mathcal{B}}\left[\left(\bar{y}-Q_{\boldsymbol{\phi}}(\boldsymbol{s},\boldsymbol{a})\right)^{2}\right]\quad(11)\\\bar{y}=r+\min_{i=1,2}Q_{\bar{\boldsymbol{\phi}}_{i}}(s,^{\prime}\boldsymbol{a}^{\prime}\sim\pi_{\bar{\boldsymbol{\theta}}}).
L(ϕ)=E(s,a)∼B[(yˉ−Qϕ(s,a))2](11)yˉ=r+i=1,2minQϕˉi(s,′a′∼πθˉ).
伪代码
结果
对比的方法有点老,不知道和最近的一些Off2On、UPQ、E2O如何