Block Successive Upper Bound Minimization Method（BSUM）算法

BSUM优化方法学习

先验知识
参考资料1 A Uniﬁed Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization
SUCCESSIVE UPPER-BOUND MINIMIZATION (SUM) 连续上限最小化算法
THE BLOCK SUCCESSIVE UPPER-BOUND MINIMIZATION ALGORITHM 块连续上限最小化算法
应用案例
参考资料2 A Block Successive Upper Bound Minimization Method of Multipliers for Linearly Constrained Convex Optimization
- 概述
- 应用例子
- 文献综述

先验知识

在这里插入图片描述

参考资料1 A Uniﬁed Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization

块坐标下降 (BCD) 方法广泛用于最小化多个块变量的连续函数 f。在此方法的每次迭代中，都会优化单个变量块，而其余变量保持固定。为了保证BCD方法的收敛性，每次迭代中待优化的子问题都需要准确地求解到其唯一的最优解。不幸的是，这些要求对于许多实际场景来说往往限制太多。在本文中，我们研究了一种替代的不精确 BCD 方法，该方法通过连续最小化 f 的一系列近似值来更新变量块，这些近似值要么是 f 的局部紧上限，要么是 f 的严格凸局部近似值。我们专注于表征相当广泛的此类方法的收敛特性，特别是对于目标函数不可微或非凸的情况。我们的结果统一并扩展了许多经典算法的现有收敛结果，例如BCD方法、凸函数差（DC）方法、期望最大化（EM）算法以及交替近端最小化算法。

Block Coordinate Descent, Block Successive Upper-bound Minimization, Successive Convex Ap-
proximation, Successive Inner Approximation 在这里插入图片描述

SUCCESSIVE UPPER-BOUND MINIMIZATION (SUM) 连续上限最小化算法

为了深入了解一般的不精确 BCD 方法，让我们首先考虑一种简单的连续上限最小化 (SUM) 方法，其中所有变量都分组到一个块中。尽管形式简单，SUM 算法却是 DC 编程 [33] 和 EM 算法 [4] 等许多重要算法的关键。考虑以下优化问题

考虑以下优化问题
$\begin{array}{cc}\min&f(x)\\\\\text{s.t.}&x\in\mathcal{X},\end{array}$

其中 $\mathcal{X}$ 是闭凸集。不失一般性，我们可以假设 $\text{dom} \mathcal{f} = \mathcal{X}$ 。当目标函数 $\mathcal{f}(·)$ 非凸和/或非光滑时，直接求解 (2) 可能并不容易。 SUM 算法通过优化一系列近似目标函数来规避这种困难。更具体地说，从可行点 $x^0$ 开始，算法根据以下更新规则生成序列 ${x^r\}$
$x^r\in\arg\min_{x\in\mathcal{X}}\quad u(x,x^{r-1})$

其中， $x^{r-1}$ 是由算法在第 $r - 1$ 次迭代生成的点，而 $u(x, x^{r-1})$ 是第 $r$ 次迭代中对 $f (x)$ 的近似。通常需要选择近似函数 $u(\cdot, \cdot)$ ，使得子问题（3）易于求解。此外，为了保证 SUM 算法的收敛性，需要满足 $u(\cdot, \cdot)$ 的某些规律性条件（稍后将讨论）。在其它要求中， $u(x, x^{r-1})$ 需要是 $f (x)$ 的全局上界，因此得名该算法。SUM 算法的主要步骤如图 $1$ 所示。

在这里插入图片描述

我们指出，所提出的 SUM 算法在许多方面与[21]中开发的内近似算法（IAA）相似，但有以下主要区别：
• IAA 算法近似目标函数和可行集。相反，SUM算法仅近似目标函数。
• IAA 算法仅适用于具有光滑目标的问题，而SUM 算法也能够处理非光滑目标。

值得一提的是，现有的IAA算法的收敛结果相当弱。特别是，[21，定理 1] 指出，如果整个序列收敛，那么算法应该收敛到一个驻点。接下来，我们表明，只要近似函数 $u(\cdot, \cdot)$ 满足我们在下面概述的某些温和假设1，SUM 算法就可以提供更强的收敛保证。

Assumption 1 Let the approximation function $u(\cdot,\cdot)$ satisfy the following
(A1)(A2)(A3)(A4)
$\begin{aligned} &u(y,y)=f(y),\quad\forall y\in\mathcal{X} \\ &u(x,y)\geq f(x),\quad\forall x,y\in\mathcal{X} \\ &u^{\prime}(x,y;d)\bigg|_{x=y}=f^{\prime}(y;d),\quad\forall d \mathrm{with} y+d\in\mathcal{X} \\ &u(x,y)\text{ is continuous in }(x,y) \end{aligned}$

The assumptions (Al) and (A2) imply that the approximate function $u(\cdot,x^{r-1})$ in (3) is a tight upper bound of the original function. The assumption (A3) guarantees that the first order behavior of $u(\cdot,x^{r-1})$ is the same as $f(\cdot)$ locally (note that the directional derivative $u^{\prime}(x,y;d)$ is only with respect to the variable $x) .$ Although directly checking (A3) may not be easy, the following proposition provides a sufficient condition under which (A3) holds true automatically.

假设（Al）和（A2）意味着在（3）中的近似函数 $u(\cdot, x^{r-1})$ 是原始函数的紧上界。假设（A3）保证了 $u(\cdot, x^{r-1})$ 的一阶导数为与局部的 $f(\cdot)$ 相同（注意，方向导数 $u^{'} (x, y; d)$ 仅针对变量 $x$ ）。虽然直接验证（A3）可能并不容易，但以下命题提供了一个充分条件，在此条件下（A3）自动成立。

Proposition 1 Assume $f(x)=f_0(x)+f_1(x),where$ $f_0(\cdot)$ is continuously differentiable and the directional derivative $off_{1}( \cdot )$ exists $a t$ every point $x\in \mathcal{X} .$ Consider $u( x, y) = u_{0}( x, y) + f_{1}( x)$ , where $u_{0}( x, y)$ is $a$ continuously differentiable function satisfying the following conditions

命题 1 假设 $f(x) = f_0(x) + f_1(x)$ ，其中 $f_0(\cdot)$ 是连续可微的，并且 $f_1(\cdot)$ 的方向导数在每一点 $\in \mathcal{X}$ 都存在。考虑 $u(x, y) = u_{0}(x, y) + f_{1}(x)$ ，其中 $u_{0}(x, y)$ 是一个连续可微的函数，满足以下条件：
$\begin{aligned}&u_{0}(y,y)=f_{0}(y),\quad\forall y\in\mathcal{X}\\&u_{0}(x,y)\geq f_{0}(x),\quad\forall x,y\in\mathcal{X}.\end{aligned}$

Then, (A1), (A2) and (A3) hold for $u(\cdot, \cdot)$ .

Theorem 1 : Assume that Assumption 1 is satisfied. Then every limit point of the iterates generated by
the SUM algorithm is a stationary point of the problem (2).

定理 1 假设满足假设 1。那么，由 SUM 算法生成的迭代序列的每一个极限点都是问题（2）的稳定点。

(10)

$Proof{:}$ Firstly, we observe the following series of inequalities

$f(x^{r+1})\stackrel{(\mathrm{i})}{\leq}u(x^{r+1},x^r)\stackrel{(\mathrm{ii})}{\leq}u(x^r,x^r)=f(x^r),\quad\forall\:r=0,1,2,\ldots$

where step (i) is due to (Al), step (ii) follows from the optimality of $x^{t+1}$ (cf. step 4 and 5 in Figll), and the last equality is due to (A2). A straightforward consequence of (10) is that the sequence of the objective function values are non-increasing, that is

$f(x^0)\geq f(x^1)\geq f(x^2)\geq\ldots$

Assume that there exists a subsequence ${x^{r_j}\}$ converging to a limit point $z .$ Then Assumptions (A1),
(A2) together with (11) imply that

$u(x^{r_{j+1}},x^{r_{j+1}})=f(x^{r_{j+1}})\leq f(x^{r_{j}+1})\leq u(x^{r_{j}+1},x^{r_{j}})\leq u(x,x^{r_{j}}),\quad\forall\:x\in\mathcal{X}$
Letting $j\to\infty$ , we obtain

$u(z,z)\leq u(x,z),\quad\forall\:x\in\mathcal{X},$

(11)

which implies

$\left.u'(x,z;d)\right|_{x=z}\geq0,\quad\forall\:d\in\mathbb{R}^m\:\text{with}\:z+d\in\mathcal{X}.$

Combining with (A3), we obtain

$f'(z;d)\geq0,\quad\forall\:d\in\mathbb{R}^m\:\text{with}\:z+d\in\mathcal{X},$

implying that $z$ is a stationary point of $f(\cdot).$

$\textbf{Corollary 1 Assume that the level set }\mathcal{X} ^0= \{ x\mid f( x) \leq f( x^0) \}$ is compact and Assumption $\mathbb{Z}$ holds.
Then, the sequence of iterates ${x^r\}$ generated by the SUM algorithm satisfy

$\lim\limits_{r\to\infty}\quad d(x^r,\mathcal{X}^*)=0,$

where $X^*$ is the set of stationary points of (2).

THE BLOCK SUCCESSIVE UPPER-BOUND MINIMIZATION ALGORITHM 块连续上限最小化算法

在许多实际应用中，优化变量（复数）可以分解为独立的块（复数）。当明智地利用这种块结构时，可以产生可分布式实现的低复杂度算法。在本节中，我们介绍块连续上界最小化（BSUM）算法，该算法有效地考虑了这种块结构。

让我们假设可行集 $\mathcal{X}$ 是 $n$ 个闭凸集的笛卡尔积： $\mathcal{X}=\mathcal{X}_1\times\ldots\times\mathcal{X}_n$ ，其中 $\mathcal{X}_i \subseteq \mathbb{R}^{m_i}$ ，且 $\sum_i m_i = m$ 。相应地，优化变量 $\in \mathbb{R}^m$ 可以被分解为： $x=(x_{1},x_{2},\ldots,x_{n})$ ，其中 $x_i \in \mathcal{X}_{i}$ ， $\cdots, n$ 。我们感兴趣的问题是
$\begin{array}{ll}\min&f(x)\\\\\mathrm{s.t.}&x\in\mathcal{X}.\end{array}$

与 SUM 算法不同，BSUM 算法在每次迭代中仅更新单个变量块。更准确地说，在第 $r$ 次迭代时，通过解决以下子问题来计算所选块（例如块 $i$ ）
$\begin{aligned}&\min_{x_{i}}\quad u_{i}(x_{i},x^{r-1})\\&\mathrm{s.t.}\quad x_{i}\in\mathcal{X}_{i},\end{aligned}$

其中 $u_i(\cdot, x^{r-1})$ 再次是对原始目标函数 $f(\cdot)$ 在点 $x^{r-1}$ 的近似（实际上是全局上界）。图 2 总结了 BSUM 算法的主要步骤。注意，尽管块是按照简单的循环规则更新的，但算法及其收敛结果可以轻松扩展到（更通用的）本质上的循环更新规则。这一点将在第七节进一步阐述。
where $u_i(\cdot,x^{r-1})$ is again an approximation (in fact, a global upper-bound) of the original objective $f(\cdot)$ at the point $x^{r-1}$ .Fig. $\color{red}{\boxed{2}}$ summarizes the main steps of the BSUM algorithm. Note that although the blocks are updated following a simple cyclic rule, the algorithm and its convergence results can be easily extended to the (more general) essentially cyclic update rule as well. This point will be further elaborated in Section VII

现在我们准备研究 BSUM 算法的收敛行为。为此，函数 $u_{i}( \cdot, \cdot)$ 需要满足以下正则条件。

Assumption 2
(B1)(B2)(B3)(B4)
$\begin{aligned}&u_{i}(y_{i},y)=f(y),\quad\forall y\in\mathcal{X},\forall i\\&u_{i}(x_{i},y)\geq f(y_{1},\ldots,y_{i-1},x_{i},y_{i+1},\ldots,y_{n}),\quad\forall x_{i}\in\mathcal{X}_{i},\forall y\in\mathcal{X},\forall i\\&u_{i}'(x_{i},y;d_{i})\bigg|_{x_{i}=y_{i}}=f'(y;d),\quad\forall d=(0,\ldots,d_{i},\ldots,0) \mathrm{s.t.} y_{i}+d_{i}\in\mathcal{X}_{i},\forall i\\&u_{i}(x_{i},y) \text{is continuous in }(x_{i},y),\quad\forall i\end{aligned}$
与命题1类似，我们可以确定一个充分条件来保证(B3)。

Proposition 2 Assume $f(x)=f_0(x)+f_1(x),where$ $f_0(\cdot)$ is continuously differentiable and the directional derivative of $f_1(\cdot)$ exists at every point $x\in \mathcal{X} . \textit{Consider }u_i( x_i, y) = u_{0, i}( x_i, y) + f_1( x) , where$ $u_{0, i}( x_i, y) \textit{ satisfres the following assumptions}$

命题 2 假设 $f(x)=f_0(x)+f_1(x)$ ，其中 $f_0(\cdot)$ 是连续可微的，并且 $f_1(\cdot)$ 的方向导数在每一点 $\in \mathcal{X}$ 都存在。考虑 $u_i(x_i, y) = u_{0, i}(x_i, y) + f_1(x)$ ，其中 $u_{0, i}(x_i, y)$ 满足以下假设：
$\begin{aligned}&u_{0,i}(x_{i},x)=f_0(x),\quad\forall x\in\mathcal{X},\quad\forall i\\&u_{0,i}(x_{i},y)\geq f_0(y_1,\ldots,y_{i-1},x_i,y_{i+1},\ldots,y_n), \forall x,y\in\mathcal{X}\quad\forall i.\end{aligned}$

Then, (B1), (B2), and (B3) hold.

证明：证明与命题1的证明完全相同。

BSUM算法的收敛结果由两部分组成。第一部分假设目标函数拟凸，保证了极限点的存在。这与 [2] 中 BCD 方法的经典收敛证明的精神相同。然而，如果我们知道迭代位于一个紧凑的集合中，那么就可以证明更强的结果。事实上，在定理的第二部分中，收敛是通过放宽拟凸性假设同时施加水平集的紧性假设来获得的。

Theorem 2
(a) 假设函数 $u_i(x_i, y)$ 在 $x_i$ 上是准凸的，并且假设 $\boxed{2}$ 成立。此外，假设子问题 (13) 对于任意点 $x^{r-1} \in \mathcal{X}$ 有唯一解。那么，由 BSUM 算法生成的迭代序列的每个极限点 $z$ 都是 (12) 的坐标最小值。此外，如果 $f(\cdot)$ 在 $z$ 处是正则的，那么 $z$ 是 (12) 的稳定点。

(b) 假设水平集 $\mathcal{X}^0 = \{x \mid f(x) \leq f(x^0)\}$ 是紧致的，并且假设 $\boxed{2}$ 成立。此外，假设对于任意点 $x^{r-1} \in \mathcal{X}$ ，至少 $n - 1$ 个块的子问题 (13) 有唯一解。如果 $f(\cdot)$ 在稳定点集 $X^*$ 中的每个点相对于坐标 $x_{1}, \ldots, x_{n}$ 都是正则的。那么，由 BSUM 算法生成的迭代序列收敛到稳定点集，即
$\lim\limits_{r\to\infty}\quad d(x^r,\mathcal{X}^*)=0.$

应用案例

论文
Movable Frequency Diverse Array-Assisted Covert Communication With Multiple Wardens

Next, we define the non-convex function in (21a) as
$\left.y_{k,m,n}\left(x\right)=\cos\left[2\pi\left(\begin{array}{c}\left(f_mx-f_nx_n\right)\frac{\sin\theta_{w_k}-\sin\theta_b}{c}\\+\left(f_m-f_n\right)\frac{r_b-r_{w_k}}{c}\end{array}\right.\right)\right].\\(22)$

Following the BSUM method, the objective function (22)
is approximated by the upper-bound quadratic function
$u_{k,m,n} (x)$ , which is defined by
$u_{k,m,n}\left(x\right)=A_{k,m,n}(x-B_{k,m,n})^{2}+C_{k,m,n},$

where $A_{k,m,n}\in\mathbb{R}, A_{k,m,n}>0, B_{k,m,n}\in\mathbb{R}$ and $C_{k,m,n}\in \mathbb{R}$ are the parameters of the new quadratic function. For a given point $x_m^{s-1}$ in (22), the approximate function (23) should satisfy the following constraints:
$\left.\left\{\begin{array}{l}u_{k,m,n}\left(x_m^{s-1};x_n^{s-1}\right)=y_{k,m,n}\left(x_m^{s-1};x_n^{s-1}\right)\\u_{k,m,n}^{'}\left(x_m^{s-1};x_n^{s-1}\right)=y_{k,m,n}^{'}\left(x_m^{s-1};x_n^{s-1}\right)\\u_{k,m,n}\left(B_{k,m,n};x_n^{s-1}\right)\in\{1,-1\}\\-1\leq y_{k,m,n}\left(B_{k,m,n};x_n^{s-1}\right)\leq1,\end{array}\right.\right.$

参考资料2 A Block Successive Upper Bound Minimization Method of Multipliers for Linearly Constrained Convex Optimization

概述

Consider the problem of minimizing a convex function $f (x)$ subject to linear equality constraints:

$\begin{aligned}\text{minimize}\: f(x):=g\left(x_1,\cdots,x_K\right)+\sum h_k(x_k)\end{aligned}$

(1.1)

$\begin{aligned}\text{subject to}\: E_1x_1+E_2x_2+\cdots+E_Kx_K=q,\end{aligned}$

$x_k\in X_k$ , $k = 1, 2, ..., K$

where $g(\cdot)$ is a smooth convex function; $h_k$ is a nonsmooth convex function; $x=(x_1^T,...,x_K^T)^T\in\Re^n$ is a partition of the optimization variable $x,x_k\in\Re^{n_k};X=\prod_{k=1}^KX_k$ is the feasible set for $x;q\in\Re^m$ is a vector. Let $E:=(E_1,\cdots,E_K)$ and $h(x):=\sum_k=1^Kh_k(x_k).$ Many contemporary problems in signal processing, machine learning and smart grid systems can be formulated in the form (1.1) To motivate our work, we discuss several examples of the form (1.1) below.

在这里插入图片描述

应用例子

basis pursuit (BP) problem
在这里插入图片描述

the control of a smart grid system

在这里插入图片描述

cognitive radio network
(CRN)
在这里插入图片描述

文献综述

当线性耦合约束不存在时，求解（1.1）的一个众所周知的技术是使用所谓的块坐标下降（BCD）方法，在每次迭代中，单个变量块被优化，而其余块保持固定。更具体地，在迭代r，通过以下方式以高斯-赛德尔方式更新块：

在这里插入图片描述

由于每一步都涉及解决一个小规模的简单子问题，BCD方法对于解决大规模问题非常有效;参见例如，[11 -14]和其中的参考文献。BCD方法的现有分析[15-18]要求每个子问题（1.7）的极小值的唯一性，或f的拟凸性[19]。当问题（1.7）不容易求解时，一种流行的方法是求解问题（1.7）的近似版本，产生块坐标梯度下降（BCGD）算法（或存在非光滑函数h时的块坐标近似梯度算法）[13，20-22]。BCD型算法的全局收敛速度已被广泛研究。当目标函数是强凸时，BCD算法全局线性收敛[23]。当目标函数光滑且不是强凸时，Luo和Tseng证明了BCD方法及其许多变体仍然可以线性收敛，只要在解集周围满足一定的局部误差界条件[23-26]。这条分析线最近被扩展到允许目标中的某类非光滑函数[21，27 -29]。最近有一些工作描述了BCD型算法的全局次线性收敛速度[14，22，30，31]。特别地，参考文献[30]证明了对于一类非光滑凸问题，带有Gauss-Seidel更新规则的BCD算法在O（1 r）阶下是次线性收敛的.此外，在[1]中提出了一个统一的算法框架，称为BSUM（块连续上限最小化）及其收敛性分析，其中在每一步中，目标函数的局部紧上限被连续最小化以更新可变块。

当存在线性耦合约束时，众所周知，BCD型算法可能无法找到任何（局部）最优解[32]。解决这类问题的常用算法是所谓的交替方向乘子法（ADMM）[33-36]。在ADMM方法中，不是始终保持可行性，而是使用拉格朗日乘数y将约束Ex = q对偶化，并添加二次惩罚项。所得到的增广拉格朗日函数具有以下形式：
在这里插入图片描述
其中 ρ > 0 是常数，<·,·> 表示内积运算符。 ADMM 方法通过使用 BCD 类型过程更新原始块变量 x1,… , xn 来最小化 L(x; y)。后者通常会导致具有封闭形式解决方案的简单子问题。这些原始更新之后是对偶变量 y 的梯度上升更新。

尽管 ADMM 算法早在 1976 年就由 Gabay、Mercier、Glowinski 和 Marrocco 提出[35,37]，但由于其在机器学习和计算机视觉产生的现代大规模优化问题中的应用，它直到最近才变得流行起来[33] ，38-42]。在实践中，该算法通常在计算上非常高效，并且比传统算法（例如双上升算法[43-45]或乘法器方法[46]）收敛速度更快。 ADMM的收敛性是在目标可分离且只有两个块变量的条件下成立的，即g(x1,····,xK)=g1(x1)+····+gK(xK)， K = 2 [35,37]。对于诸如压缩感知引起的大规模问题，原始每块子问题的最优解可能不容易计算[47]。在这些情况下，经典的 ADMM 可以修改为对每个子问题执行简单的近端梯度步骤 [34,40,47-50]。当只有两个块变量时，最近的一些工作 [51, 52] 表明 ADMM 方法以 O(1/ r ) 的速率收敛（对于加速版本 [53] 则为 O( 1 / r2 )）。此外，参考文献[53-55]表明，当目标函数是强凸且只有两个变量块时，ADMM 会线性收敛。最近的一项研究 [56] 表明，在 K ≥ 3 的情况下，ADMM 的全局（线性）收敛，假设： a) 对于每个 k，Ek 是满列秩； b) 双步长足够小； c) 最优解集周围存在一定的误差界限； d) 目标是可分离的。如果不满足这些条件并且当K≥3时，[57]表明ADMM通常确实可以发散。最近的一些其他工作尝试针对 K ≥ 3 的情况修改原始 ADMM [58-60]。

不幸的是，BCD 和 ADMM 都不能用来解决问题（1.1）。事实上，由于其多块结构以及目标和约束的变量耦合，该问题无法通过许多其他大数据方法来处理，包括SpaRSA [61]，FPCBB [62]，FISTA [63] ]、ALM [64]、HOGWILD [65]、FPA [66]。本文的主要贡献是提出并分析了一种新颖的乘子分块连续上限最小化方法（BSUM-M）及其随机版本，可以有效地解决问题（1.1）。 BSUM-M算法集成了BSUM和ADMM算法，每次优化原始问题一个块变量的近似增广拉格朗日，然后使用梯度上升步骤更新对偶变量。由此产生的算法是灵活的，因为我们可以选择合适的增强拉格朗日函数的近似值，从而可以方便地更新原始变量块（例如以封闭形式）。在没有线性耦合约束的情况下，随机BSUM-M算法简化为随机BCD算法。在本例中，我们表明，对于一系列没有强凸目标的问题，随机 BCD 算法实际上是线性收敛的（符合预期）。据我们所知，这是第一个显示随机 BCD 算法在没有强凸性的情况下线性收敛速度的结果