- 多维高斯分布是一种特殊的多维随机分布,应用非常广泛,很多现实问题的原始特征分布都可以看作多维高斯分布。本文以数据特征服从多维高斯分布的多分类任务这一理想场景为例,从理论层面分析数据特征和分类问题难度的关系
- 注意,本文分析不涉及算法层面,仅仅是对数据特征的分析,实用意义不强,纯机器学习层面可能只对特征挖掘有点作用,工程层面可以指导探测器设计。本文主旨在于借助这个场景说明多维高斯分布的信息熵和KL散度计算方法
- 关于信息熵和KL散度(相对熵)的详细说明参考:信息论概念详细梳理:信息量、信息熵、条件熵、互信息、交叉熵、KL散度、JS散度
文章目录
- 0. 场景
- 1. 多维高斯分布
- 2. 多维高斯分布的信息熵
- 3. 两个多维高斯分布之间的相对熵(KL散度)
- 4 总结
0. 场景
-
考虑一般的多分类机器学习场景:设目标识别算法的原始输入为 x = ( x , y , I , λ , θ . . . ) \pmb{x}=(x,y,I,\lambda,\theta...) x=(x,y,I,λ,θ...),定义为 k k k 维随机向量 X = ( X 1 , X 2 , … , X k ) \pmb{X}=(X_1,X_2,\dots,X_k) X=(X1,X2,…,Xk),设检测目标类别数为 n n n
-
进一步假设 X \pmb{X} X 是多维高斯分布,这意味着原始观测中每个维度都服从高斯分布。当 n = 4 n=4 n=4 时,多维随机变量 X X X 沿维度 f 0 f_0 f0 和 f 0 − f 1 f_0-f_1 f0−f1 的边缘分布示意图如下,使用不同颜色代表不同类别数据
绘图代码供参考:
import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import Ellipse # 高斯分布参数,f0和f1两个维度 means_2d = [(0, 0), (2, 3), (4, 2), (6, 5)] covariances = [ np.array([[0.35, 0.05], [0.05, 0.35]]), np.array([[0.75, 0.5], [0.5, 0.75]]), np.array([[0.2, 0.1], [0.1, 0.2]]), np.array([[1.25, -0.4], [-0.4, 1.25]]) ] colors = ['blue', 'green', 'red', 'purple'] fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8)) # 绘制沿f0的的边缘分布 means_1d = [m[0] for m in means_2d] std_devs = [c[0,0].item() for c in covariances] # 概率密度曲线 x = np.linspace(-2, 10, 1000) for i, (mean, std_dev, color) in enumerate(zip(means_1d, std_devs, colors)): y = (1/(std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev)**2) ax1.plot(x, y, color=color, label=f'Class {i}') # 抽样样本用竖线表示 for mean, std_dev, color in zip(means_1d, std_devs, colors): samples = np.random.normal(mean, std_dev, 20) ax1.vlines(samples, ymin=0, ymax=(1/(std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((samples - mean) / std_dev)**2), color=color, linestyle='-', alpha=0.2) # 图例和标题 ax1.legend() ax1.set_title('Distribution of raw observation along feature $f_0$') ax1.set_xlabel('$f_0$') ax1.set_ylabel('Probability Density') # 绘制沿f0和f1的的二维边缘分布 for mean, cov, color in zip(means_2d, covariances, colors): eigenvalues, eigenvectors = np.linalg.eigh(cov) order = eigenvalues.argsort()[::-1] eigenvalues, eigenvectors = eigenvalues[order], eigenvectors[:, order] angle = np.degrees(np.arctan2(*eigenvectors[:, 0][::-1])) width, height = 3 * np.sqrt(eigenvalues) ell = Ellipse(xy=mean, width=width, height=height, angle=angle, color=color, alpha=0.2) ax2.add_artist(ell) # 抽样样本用点表示 for i, (mean, cov, color) in enumerate(zip(means_2d, covariances, colors)): samples = np.random.multivariate_normal(mean, cov, 100) ax2.scatter(samples[:, 0], samples[:, 1], color=color, s=4, label=f'Class {i}') # 添加图例和标题 ax2.legend() ax2.set_title('Distribution of raw observation along feature $f_0$ & $f_1$') ax2.set_xlabel('$f_0$') ax2.set_ylabel('$f_1$') # 显示图形 plt.tight_layout() plt.show()
-
注意图示为原始输入数据 f f f 的样本分布情况,识别算法从中提取特征用于分类。注意到
- 数据 D i \mathcal{D}_i Di(分布 p i p_i pi )的集中程度描述了第 i i i 类目标的原始特征强度。分布越集中,说明探测系统对第 i i i 类目标的观测一致性越强,其受系统噪音和环境因素影响程度越小
- 不同数据 D i \mathcal{D}_i Di 和 D j \mathcal{D}_j Dj(分布 p i p_i pi 和 p j p_j pj )之间的距离刻画了分布间差异程度。两类数据距离越大则差异越强,越容易区分。反之,若两类数据的原始数据分布有重叠(如 p 1 , p 2 p_1,p_2 p1,p2),则理论上重叠部分的样本所属类别无法准确区分
-
为最大化识别准确率,从两方面考虑(以下准则可以用来指导探测器设计)
- 增大类内特征强度(同类数据尽量集中),形式化表示为最小化每一类数据分布信息熵的总和
- 增大类间特征差异(不同类数据尽量远离),形式化表示为最大化每一组数据分布相对熵(KL散度)的总和。值得注意的是,由于KL散度不对称,无法直接作为度量,这里还需要一些trick
1. 多维高斯分布
- 设原始输入为
x
=
(
x
,
y
,
I
,
λ
,
θ
.
.
.
)
\pmb{x}=(x,y,I,\lambda,\theta...)
x=(x,y,I,λ,θ...),定义为
k
k
k 维随机向量
X
=
(
X
1
,
X
2
,
…
,
X
k
)
\pmb{X}=(X_1,X_2,\dots,X_k)
X=(X1,X2,…,Xk),设检测目标类别数为
n
n
n,第
i
i
i 类目标的原始观测输入服从多维高斯分布,即:
p i ( x ) = p i ( x 1 , . . . , x k ) = 1 ( 2 π ) n / 2 det ( Σ i ) 1 / 2 e − 1 2 ( x − μ i ) ⊤ Σ i − 1 ( x − μ i ) (1.1) p_i(\pmb{x}) = p_i(x_1,...,x_k) = \frac{1}{(2\pi)^{n/2}\operatorname{det}(\pmb{\Sigma}_i)^{1/2}}e^{-\frac{1}{2}(\mathbf{x}-\pmb{\mu}_i)^\top \mathbf{\Sigma}_i^{-1}(\mathbf{x}-\pmb{\mu}_i)} \tag{1.1} pi(x)=pi(x1,...,xk)=(2π)n/2det(Σi)1/21e−21(x−μi)⊤Σi−1(x−μi)(1.1) 称随机向量 X \pmb{X} X 服从期望为 μ i \pmb{\mu}_i μi,协方差矩阵为 Σ i \pmb{\Sigma}_i Σi 的多维正态分布,记为 X ∼ N ( μ i , Σ i ) \pmb{X}\sim N(\pmb{\mu}_i,\pmb{\Sigma}_i) X∼N(μi,Σi)。其中 Σ i \pmb{\Sigma}_i Σi 为协方差矩阵:
Σ i = [ D X 1 Cov ( X 1 , X 2 ) … Cov ( X 1 , X k ) Cov ( X 2 , X 1 ) D X 2 … Cov ( X 2 , X k ) ⋮ ⋮ ⋮ Cov ( X k , X 1 ) Cov ( X k , X 2 ) … D X k ] = [ σ 1 2 ρ 12 σ 1 σ 2 … ρ 1 k σ 1 σ k ρ 21 σ 2 σ 1 σ 2 2 … ρ 2 k σ 2 σ k ⋮ ⋮ ⋮ ρ k , 1 σ k σ 1 ρ k , 2 σ k σ 2 … σ k 2 ] (1.2) \begin{aligned} \pmb{\Sigma}_i &= \begin{bmatrix} DX_1 &\text{Cov}(X_1,X_2) &\dots &\text{Cov}(X_1,X_k) \\ \text{Cov}(X_2,X_1) &DX_2 &\dots &\text{Cov}(X_2,X_k) \\ \vdots &\vdots &&\vdots\\ \text{Cov}(X_k,X_1) &\text{Cov}(X_k,X_2) &\dots &DX_k \end{bmatrix}\\ \space \\ & = \begin{bmatrix} \sigma_1^2 &\rho_{12}\sigma_1\sigma_2 &\dots &\rho_{1k}\sigma_1\sigma_k \\ \rho_{21}\sigma_2\sigma_1 &\sigma_2^2 &\dots &\rho_{2k}\sigma_2\sigma_k \\ \vdots &\vdots &&\vdots\\ \rho_{k,1}\sigma_k\sigma_1 &\rho_{k,2}\sigma_k\sigma_2 &\dots &\sigma_k^2 \end{bmatrix} \end{aligned} \tag{1.2} Σi = DX1Cov(X2,X1)⋮Cov(Xk,X1)Cov(X1,X2)DX2⋮Cov(Xk,X2)………Cov(X1,Xk)Cov(X2,Xk)⋮DXk = σ12ρ21σ2σ1⋮ρk,1σkσ1ρ12σ1σ2σ22⋮ρk,2σkσ2………ρ1kσ1σkρ2kσ2σk⋮σk2 (1.2) - 显然
Σ
\pmb{\Sigma}
Σ 是实对称矩阵,通常情况下是正定矩阵(只要
X
i
X_i
Xi 不为常数,其任意阶顺序主子式 > 0)。这种矩阵有一些良好性质,列举一些以便后续计算
- Σ − 1 , Σ 1 + Σ 2 \pmb{\Sigma}^{-1}, \pmb{\Sigma}_1+\pmb{\Sigma}_2 Σ−1,Σ1+Σ2 也是正定对称矩阵
- 奇异值分解(SVD分解)结果和特征值分解(谱分解)一致,即有
Σ = U Λ U ⊤ (1.3) \pmb{\Sigma} = \pmb{U}\pmb{\Lambda}\pmb{U}^{\top} \tag{1.3} Σ=UΛU⊤(1.3) 其中 U \pmb{U} U 是正交矩阵, Λ \pmb{\Lambda} Λ 是对角阵且对角线上的元素是正的(正定矩阵),这意味着 Σ \pmb{\Sigma} Σ 可以如下开方:
Σ 1 2 = U Λ 1 2 U ⊤ (1.4) \pmb{\Sigma}^{\frac{1}{2}} = \pmb{U}\pmb{\Lambda}^{\frac{1}{2}}\pmb{U}^{\top} \tag{1.4} Σ21=UΛ21U⊤(1.4) 且 Σ 1 2 \pmb{\Sigma}^{\frac{1}{2}} Σ21 也是正定对称矩阵
- 多维高斯分布有以下基本统计量(期望、方差、二阶矩)
E x [ x ] = ∫ p ( x ) x d x = μ E x [ ( x − μ ) ( x − μ ) ⊤ ] = ∫ p ( x ) ( x − μ ) ( x − μ ) ⊤ d x = Σ E x [ x x ⊤ ] = μ μ ⊤ + E x [ ( x − μ ) ( x − μ ) ⊤ ] = μ μ ⊤ + Σ (1.5) \begin{aligned} \mathbb{E}_{\boldsymbol{x}}[\boldsymbol{x}] & =\int p(\boldsymbol{x}) \boldsymbol{x} d x=\boldsymbol{\mu} \\ \mathbb{E}_{\boldsymbol{x}}\left[(\boldsymbol{x}-\boldsymbol{\mu})(\boldsymbol{x}-\boldsymbol{\mu})^{\top}\right] & =\int p(\boldsymbol{x})(\boldsymbol{x}-\boldsymbol{\mu})(\boldsymbol{x}-\boldsymbol{\mu})^{\top} d x=\boldsymbol{\Sigma} \\ \mathbb{E}_{\boldsymbol{x}}\left[\boldsymbol{x} \boldsymbol{x}^{\top}\right]&=\boldsymbol{\mu} \boldsymbol{\mu}^{\top}+\mathbb{E}_{\boldsymbol{x}}\left[(\boldsymbol{x}-\boldsymbol{\mu})(\boldsymbol{x}-\boldsymbol{\mu})^{\top}\right]=\boldsymbol{\mu} \boldsymbol{\mu}^{\top}+\boldsymbol{\Sigma} \end{aligned} \tag{1.5} Ex[x]Ex[(x−μ)(x−μ)⊤]Ex[xx⊤]=∫p(x)xdx=μ=∫p(x)(x−μ)(x−μ)⊤dx=Σ=μμ⊤+Ex[(x−μ)(x−μ)⊤]=μμ⊤+Σ(1.5) 其中二阶矩就是普通随机变量常见结论 E ( X 2 ) = E ( X ) 2 + D ( X ) \mathbb{E}(X^2) = \mathbb{E}(X)^2 + D(X) E(X2)=E(X)2+D(X) 的多维扩展
2. 多维高斯分布的信息熵
- 设第
i
i
i 类目标原始观测的分布
p
i
(
x
)
=
N
(
μ
i
,
Σ
i
)
p_i(\pmb{x})=\mathcal{N}(\pmb{\mu_i},\pmb{\Sigma_i})
pi(x)=N(μi,Σi),计算其信息熵
H p i = E x ∼ p i ( x ) [ − log p i ( x ) ] = E x ∼ p i ( x ) [ n 2 log 2 π + 1 2 log det ( Σ i ) + 1 2 ( x − μ i ) ⊤ Σ i − 1 ( x − μ i ) ] = k 2 log 2 π + 1 2 log det ( Σ i ) + 1 2 E x ∼ p i ( x ) [ ( x − μ i ) ⊤ Σ i − 1 ( x − μ i ) ] (2.1) \begin{aligned} \mathcal{H}_{p_i}&=\mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}[-\log p_i(\boldsymbol{x})] \\ &=\mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\frac{n}{2} \log 2 \pi+\frac{1}{2} \log \operatorname{det}\left(\pmb{\Sigma}_i\right)+\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)^{\top} \boldsymbol{\Sigma}_i^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)\right] \\ & =\frac{k}{2} \log 2 \pi+\frac{1}{2} \log \operatorname{det}\left(\boldsymbol{\Sigma}_i\right)+\frac{1}{2} \mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)^{\top} \boldsymbol{\Sigma}_i^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)\right] \end{aligned} \tag{2.1} Hpi=Ex∼pi(x)[−logpi(x)]=Ex∼pi(x)[2nlog2π+21logdet(Σi)+21(x−μi)⊤Σi−1(x−μi)]=2klog2π+21logdet(Σi)+21Ex∼pi(x)[(x−μi)⊤Σi−1(x−μi)](2.1) 最后一项的展开需要借助马哈拉诺比斯变换
:对于任意服从多维高斯分布的随机向量 X ∼ N ( μ , Σ ) \pmb{X}\sim\mathcal{N}(\pmb{\mu},\pmb{\Sigma}) X∼N(μ,Σ), 有随机向量 Y = Σ − 1 2 ( x − μ ) \pmb{Y} = \pmb{\Sigma}^{-\frac{1}{2}}(\pmb{x-\mu}) Y=Σ−21(x−μ) 是标准高斯分布,即 Y ∼ N ( 0 , I k ) \pmb{Y} \sim \mathcal{N}(0,\pmb{I}_k) Y∼N(0,Ik),注意到
x = Σ 1 2 y + μ (2.2) \pmb{x} = \pmb{\Sigma}^{\frac{1}{2}}\pmb{y}+\pmb{\mu} \tag{2.2} x=Σ21y+μ(2.2) 故有
E x ∼ p i ( x ) [ ( x − μ i ) ⊤ Σ i − 1 ( x − μ i ) ] = E y ∼ p i ( y ) [ ( Σ i 1 2 y ) ⊤ Σ i − 1 ( Σ i 1 2 y ) ] = E y ∼ p i ( y ) [ y ⊤ y ] = ∑ i = 1 k E [ y i 2 ] = k (2.3) \begin{aligned} \mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)^{\top} \boldsymbol{\Sigma}_i^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)\right] & =\mathbb{E}_{\boldsymbol{y} \sim p_i(\boldsymbol{y})}\left[\left(\pmb{\Sigma}_i^{\frac{1}{2}}\boldsymbol{y}\right)^{\top} \boldsymbol{\Sigma}_i^{-1}\left(\pmb{\Sigma}_i^{\frac{1}{2}}\boldsymbol{y}\right)\right] \\ &=\mathbb{E}_{\boldsymbol{y} \sim p_i(\boldsymbol{y})}\left[\pmb{y}^\top \pmb{y}\right] \\ &=\sum_{i=1}^k \mathbb{E}[y_i^2] \\ &=k \end{aligned} \tag{2.3} Ex∼pi(x)[(x−μi)⊤Σi−1(x−μi)]=Ey∼pi(y)[(Σi21y)⊤Σi−1(Σi21y)]=Ey∼pi(y)[y⊤y]=i=1∑kE[yi2]=k(2.3) - 综上所述,有
H p i = k 2 log 2 π + 1 2 log det ( Σ i ) + 1 2 k = k 2 ( 1 + log 2 π ) + 1 2 log det ( Σ i ) (2.4) \begin{aligned} \mathcal{H}_{p_i} &= \frac{k}{2} \log 2 \pi+\frac{1}{2} \log \operatorname{det}\left(\boldsymbol{\Sigma}_i\right)+\frac{1}{2} k \\ &= \frac{k}{2}(1+\log 2 \pi)+\frac{1}{2} \log \operatorname{det}(\boldsymbol{\Sigma_i}) \end{aligned} \tag{2.4} Hpi=2klog2π+21logdet(Σi)+21k=2k(1+log2π)+21logdet(Σi)(2.4)
3. 两个多维高斯分布之间的相对熵(KL散度)
-
下面计算第 i i i 类和第 j j j 类目标原始观测的分布 p i ( x ) , p j ( x ) p_i(\pmb{x}),p_j(\pmb{x}) pi(x),pj(x) 之间的相对熵:
D K L ( p i ∣ ∣ p j ) = E x ∼ p i ( x ) log p i ( x ) p j ( x ) = E x ∼ p i ( x ) log p i ( x ) + E x ∼ p i ( x ) [ − log p j ( x ) ] = − H p i + E x ∼ p i ( x ) [ − log p j ( x ) ] (3.1) \begin{aligned} D_{KL}(p_i||p_j) &= \mathbb{E}_{\pmb{x}\sim p_i(x)}\log\frac{p_i(\pmb{x})}{p_j(\pmb{x})} \\ &= \mathbb{E}_{\pmb{x}\sim p_i(x)}\log p_i(\pmb{x}) + {E}_{\pmb{x}\sim p_i(x)}[-\log p_j(\pmb{x})] \\ &= -\mathcal{H}_{p_i} + {E}_{\pmb{x}\sim p_i(x)}[-\log p_j(\pmb{x})]\\ \end{aligned} \tag{3.1} DKL(pi∣∣pj)=Ex∼pi(x)logpj(x)pi(x)=Ex∼pi(x)logpi(x)+Ex∼pi(x)[−logpj(x)]=−Hpi+Ex∼pi(x)[−logpj(x)](3.1) 展开其中第二项:
E x ∼ p i ( x ) [ − log p j ( x ) ] = E x ∼ p i ( x ) [ n 2 log 2 π + 1 2 log det ( Σ j ) + 1 2 ( x − μ j ) ⊤ Σ j − 1 ( x − μ j ) ] = k 2 log 2 π + 1 2 log det ( Σ j ) + 1 2 E x ∼ p i ( x ) [ ( x − μ j ) ⊤ Σ j − 1 ( x − μ j ) ] (3.2) \begin{aligned} \mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}[-\log p_j(\boldsymbol{x})] & =\mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\frac{n}{2} \log 2 \pi+\frac{1}{2} \log \operatorname{det}\left(\Sigma_{j}\right)+\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)\right] \\ & =\frac{k}{2} \log 2 \pi+\frac{1}{2} \log \operatorname{det}\left(\boldsymbol{\Sigma}_{j}\right)+\frac{1}{2} \mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)\right] \end{aligned} \tag{3.2} Ex∼pi(x)[−logpj(x)]=Ex∼pi(x)[2nlog2π+21logdet(Σj)+21(x−μj)⊤Σj−1(x−μj)]=2klog2π+21logdet(Σj)+21Ex∼pi(x)[(x−μj)⊤Σj−1(x−μj)](3.2)
进一步展开其中最后一项:
E x ∼ p ( x ) [ ( x − μ q ) ⊤ Σ q − 1 ( x − μ q ) ] = E x ∼ p i ( x ) [ Tr ( ( x − μ j ) ⊤ Σ j − 1 ( x − μ j ) ) ] . = E x ∼ p i ( x ) [ Tr ( Σ j − 1 ( x − μ j ) ( x − μ j ) ⊤ ) ] . = Tr ( Σ j − 1 E x ∼ p i ( x ) [ ( x − μ j ) ( x − μ j ) ⊤ ] ) = Tr ( Σ j − 1 E x ∼ p i ( x ) [ x x ⊤ − μ j x ⊤ − x μ j ⊤ + μ j μ j ⊤ ] ) = Tr ( Σ j − 1 ( Σ i + μ i μ i ⊤ − μ j μ i ⊤ − μ i μ j ⊤ + μ j μ j ⊤ ) ) = Tr ( Σ j − 1 Σ i + Σ j − 1 ( μ i − μ j ) ( μ i − μ j ) ⊤ ) = Tr ( Σ j − 1 Σ i ) + Tr ( Σ j − 1 ( μ i − μ j ) ( μ i − μ j ) ⊤ ) = Tr ( Σ j − 1 Σ i ) + Tr ( ( μ i − μ j ) ⊤ Σ j − 1 ( μ i − μ j ) ) = Tr ( Σ j − 1 Σ i ) + ( μ i − μ j ) ⊤ Σ j − 1 ( μ i − μ j ) (3.3) \begin{aligned} \mathbb{E}_{\boldsymbol{x} \sim p(\boldsymbol{x})}\left[\left(\boldsymbol{x}-\boldsymbol{\mu}_{q}\right)^{\top} \boldsymbol{\Sigma}_{q}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{q}\right)\right] & =\mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\operatorname{Tr}\Big((\boldsymbol{x}-\boldsymbol{\mu}_{j})^{\top} \boldsymbol{\Sigma}_{j}^{-1} (\boldsymbol{x}-\boldsymbol{\mu}_{j})\Big) \right]. \\ & =\mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\operatorname{Tr}\Big(\boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)^{\top}\Big)\right]. \\ & =\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1} \mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)^{\top}\right]\right) \\ &=\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1} \mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\boldsymbol{x} \boldsymbol{x}^{\top}-\boldsymbol{\mu}_{j} \boldsymbol{x}^{\top}-\boldsymbol{x} \boldsymbol{\mu}_{j}^{\top}+\boldsymbol{\mu}_{j} \boldsymbol{\mu}_{j}^{\top}\right]\right) \\ &=\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{\Sigma}_{i}+\boldsymbol{\mu}_{i} \boldsymbol{\mu}_{i}^{\top}-\boldsymbol{\mu}_{j} \boldsymbol{\mu}_{i}^{\top}-\boldsymbol{\mu}_{i} \boldsymbol{\mu}_{j}^{\top}+\boldsymbol{\mu}_{j} \boldsymbol{\mu}_{j}^{\top}\right)\right) \\ &=\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1} \boldsymbol{\Sigma}_{i}+\boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)^{\top}\right) \\ &=\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1} \boldsymbol{\Sigma}_{i}\right)+ \operatorname{Tr} \left(\boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)^{\top}\right)\\ &=\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1} \boldsymbol{\Sigma}_{i}\right)+ \operatorname{Tr} \left(\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)\right) \\ &=\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1} \boldsymbol{\Sigma}_{i}\right)+ \left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right) \end{aligned} \tag{3.3} Ex∼p(x)[(x−μq)⊤Σq−1(x−μq)]=Ex∼pi(x)[Tr((x−μj)⊤Σj−1(x−μj))].=Ex∼pi(x)[Tr(Σj−1(x−μj)(x−μj)⊤)].=Tr(Σj−1Ex∼pi(x)[(x−μj)(x−μj)⊤])=Tr(Σj−1Ex∼pi(x)[xx⊤−μjx⊤−xμj⊤+μjμj⊤])=Tr(Σj−1(Σi+μiμi⊤−μjμi⊤−μiμj⊤+μjμj⊤))=Tr(Σj−1Σi+Σj−1(μi−μj)(μi−μj)⊤)=Tr(Σj−1Σi)+Tr(Σj−1(μi−μj)(μi−μj)⊤)=Tr(Σj−1Σi)+Tr((μi−μj)⊤Σj−1(μi−μj))=Tr(Σj−1Σi)+(μi−μj)⊤Σj−1(μi−μj)(3.3) 注意到 i = j i=j i=j 时上式等于 k k k,退化到式 (2.4) -
综上所述,有
D K L ( p i ∥ p j ) = − H p i + E x ∼ p i ( x ) [ − log p j ( x ) ] = − ( k 2 ( 1 + log 2 π ) + 1 2 log det ( Σ i ) ) + ( k 2 log 2 π + 1 2 log det ( Σ j ) + 1 2 E x ∼ p i ( x ) [ ( x − μ j ) ⊤ Σ j − 1 ( x − μ j ) ] ) = 1 2 E x ∼ p i ( x ) [ ( x − μ j ) ⊤ Σ j − 1 ( x − μ j ) ] + 1 2 [ log det ( Σ j ) − log det ( Σ i ) − k ] = 1 2 [ Tr ( Σ j − 1 Σ i ) + ( μ i − μ j ) ⊤ Σ j − 1 ( μ i − μ j ) ] + 1 2 [ log det ( Σ j ) det ( Σ i ) − k ] = 1 2 [ Tr ( Σ j − 1 Σ i ) + ( μ i − μ j ) ⊤ Σ j − 1 ( μ i − μ j ) − log det ( Σ j − 1 Σ i ) − k ] (3.4) \begin{aligned} D_{KL}(p_i\| p_j) &= -\mathcal{H}_{p_i} + {E}_{\pmb{x}\sim p_i(x)}[-\log p_j(\pmb{x})]\\ &= -\left(\frac{k}{2}(1+\log 2 \pi)+\frac{1}{2} \log \operatorname{det}(\boldsymbol{\Sigma_i})\right) + \left(\frac{k}{2} \log 2 \pi+\frac{1}{2} \log \operatorname{det}\left(\boldsymbol{\Sigma}_{j}\right)+\frac{1}{2} \mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)\right]\right) \\ &= \frac{1}{2}\mathbb{E}_{\boldsymbol{x} \sim p_i(\boldsymbol{x})}\left[\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{j}\right)\right] + \frac{1}{2}\left[\log \operatorname{det}\left(\boldsymbol{\Sigma}_{j}\right) - \log \operatorname{det}\left(\boldsymbol{\Sigma}_{i}\right)-k\right]\\ &= \frac{1}{2}\left[\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1} \boldsymbol{\Sigma}_{i}\right)+ \left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)\right] + \frac{1}{2}\left[\log \frac{\operatorname{det}(\pmb{\Sigma}_j)}{\operatorname{det}(\pmb{\Sigma}_i)}-k\right]\\ &= \frac{1}{2}\left[\operatorname{Tr}\left(\boldsymbol{\Sigma}_{j}^{-1} \boldsymbol{\Sigma}_{i}\right)+\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)^{\top} \boldsymbol{\Sigma}_{j}^{-1}\left(\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right)-\log \operatorname{det}\left(\boldsymbol{\Sigma}_{j}^{-1} \boldsymbol{\Sigma}_{i}\right)-k\right] \end{aligned} \tag{3.4} DKL(pi∥pj)=−Hpi+Ex∼pi(x)[−logpj(x)]=−(2k(1+log2π)+21logdet(Σi))+(2klog2π+21logdet(Σj)+21Ex∼pi(x)[(x−μj)⊤Σj−1(x−μj)])=21Ex∼pi(x)[(x−μj)⊤Σj−1(x−μj)]+21[logdet(Σj)−logdet(Σi)−k]=21[Tr(Σj−1Σi)+(μi−μj)⊤Σj−1(μi−μj)]+21[logdet(Σi)det(Σj)−k]=21[Tr(Σj−1Σi)+(μi−μj)⊤Σj−1(μi−μj)−logdet(Σj−1Σi)−k](3.4) -
根据定义,KL 散度是非对称的,因此不能当作一个度量使用,为了使其称为有效度量,对于任意分布 p i , p j p_i,p_j pi,pj,交互顺序计算两次 KL 散度,求和作为二者距离的度量,即
D s y m ( p i , p j ) = D K L ( p i ∥ p j ) + D K L ( p j ∥ p i ) = 1 2 [ Tr ( Σ j − 1 Σ i ) + Tr ( Σ i − 1 Σ j ) ] + 1 2 [ ( μ i − μ j ) ⊤ Σ j − 1 ( μ i − μ j ) + ( μ j − μ i ) ⊤ Σ i − 1 ( μ j − μ i ) ] − 1 2 [ logdet ( Σ j − 1 Σ i ) + log det ( Σ i − 1 Σ j ) ] − k = 1 2 [ Tr ( Σ j − 1 Σ i ) + Tr ( Σ i − 1 Σ j ) ] + 1 2 ( μ i − μ j ) ⊤ ( Σ j − 1 + Σ i − 1 ) ( μ i − μ j ) − k (3.5) \begin{aligned} D_{sym}(p_i, p_j) &= D_{KL}(p_i\| p_j) + D_{KL}(p_j\| p_i) \\ &= \frac{1}{2}\left[\operatorname{Tr}\left(\Sigma_{j}^{-1} \Sigma_{i}\right)+\operatorname{Tr}\left(\Sigma_{i}^{-1} \Sigma_{j}\right)\right] \\ &\quad\quad\quad +\frac{1}{2}\left[\left(\mu_{i}-\mu_{j}\right)^{\top} \Sigma_{j}^{-1}\left(\mu_{i}-\mu_{j}\right)+\left(\mu_{j}-\mu_{i}\right)^{\top} \Sigma_{i}^{-1}\left(\mu_{j}-\mu_{i}\right)\right] \\ &\quad\quad\quad -\frac{1}{2}\left[\operatorname{logdet}\left(\Sigma_{j}^{-1} \Sigma_{i}\right)+\log \operatorname{det}\left(\Sigma_{i}^{-1} \Sigma_{j}\right)\right]-k \\ &=\frac{1}{2}\left[\operatorname{Tr}\left(\Sigma_{j}^{-1} \Sigma_{i}\right)+\operatorname{Tr}\left(\Sigma_{i}^{-1} \Sigma_{j}\right)\right] + \frac{1}{2}\left(\mu_{i}-\mu_{j}\right)^{\top}(\Sigma_{j}^{-1} + \Sigma_{i}^{-1}) \left(\mu_{i}-\mu_{j}\right) -k \end{aligned} \tag{3.5} Dsym(pi,pj)=DKL(pi∥pj)+DKL(pj∥pi)=21[Tr(Σj−1Σi)+Tr(Σi−1Σj)]+21[(μi−μj)⊤Σj−1(μi−μj)+(μj−μi)⊤Σi−1(μj−μi)]−21[logdet(Σj−1Σi)+logdet(Σi−1Σj)]−k=21[Tr(Σj−1Σi)+Tr(Σi−1Σj)]+21(μi−μj)⊤(Σj−1+Σi−1)(μi−μj)−k(3.5)
4 总结
- 最后,把结论式 (2.4) 和 (3.5) 回带第0节的优化目标中,引入优化协调系数
α
∈
[
0
,
1
]
\alpha\in [0,1]
α∈[0,1],我们要最大化
max J = max ( 1 − α ) ∑ i = 1 n − H p i + α ∑ i ≠ j D s y m ( p i , p j ) = max ( α − 1 ) ∑ i = 1 n H p i + α ∑ i ≠ j D s y m ( p i , p j ) = max ( α − 1 ) ∑ i = 1 n ( k 2 ( 1 + log 2 π ) + 1 2 log det ( Σ i ) ) + α ∑ i ≠ j ( 1 2 [ Tr ( Σ j − 1 Σ i ) + Tr ( Σ i − 1 Σ j ) ] + 1 2 ( μ i − μ j ) ⊤ ( Σ j − 1 + Σ i − 1 ) ( μ i − μ j ) − k ) = max ( α − 1 ) ∑ i = 1 n log det ( Σ i ) + α ∑ i ≠ j ( [ Tr ( Σ j − 1 Σ i ) + Tr ( Σ i − 1 Σ j ) ] + ( μ i − μ j ) ⊤ ( Σ j − 1 + Σ i − 1 ) ( μ i − μ j ) ) (4.1) \begin{aligned} \max\mathcal{J} &= \max \space (1-\alpha) \sum_{i=1}^n -\mathcal{H}_{p_i} + \alpha \sum_{i\neq j} D_{sym}(p_i, p_j) \\ &= \max \space (\alpha-1) \sum_{i=1}^n \mathcal{H}_{p_i} + \alpha \sum_{i\neq j} D_{sym}(p_i, p_j) \\ &= \max \space (\alpha-1) \sum_{i=1}^n \left(\frac{k}{2}(1+\log 2 \pi)+\frac{1}{2} \log \operatorname{det}(\boldsymbol{\Sigma_i})\right) + \\ &\quad\quad\quad\quad\quad\quad \alpha \sum_{i\neq j} \left(\frac{1}{2}\left[\operatorname{Tr}\left(\Sigma_{j}^{-1} \Sigma_{i}\right)+\operatorname{Tr}\left(\Sigma_{i}^{-1} \Sigma_{j}\right)\right] + \frac{1}{2}\left(\mu_{i}-\mu_{j}\right)^{\top}(\Sigma_{j}^{-1} + \Sigma_{i}^{-1}) \left(\mu_{i}-\mu_{j}\right) -k\right) \\ &= \max \space (\alpha-1)\sum_{i=1}^n \log \operatorname{det}(\boldsymbol{\Sigma_i}) + \alpha \sum_{i\neq j} \left(\left[\operatorname{Tr}\left(\Sigma_{j}^{-1} \Sigma_{i}\right)+\operatorname{Tr}\left(\Sigma_{i}^{-1} \Sigma_{j}\right)\right] +\left(\mu_{i}-\mu_{j}\right)^{\top}(\Sigma_{j}^{-1} + \Sigma_{i}^{-1}) \left(\mu_{i}-\mu_{j}\right) \right) \end{aligned} \tag{4.1} maxJ=max (1−α)i=1∑n−Hpi+αi=j∑Dsym(pi,pj)=max (α−1)i=1∑nHpi+αi=j∑Dsym(pi,pj)=max (α−1)i=1∑n(2k(1+log2π)+21logdet(Σi))+αi=j∑(21[Tr(Σj−1Σi)+Tr(Σi−1Σj)]+21(μi−μj)⊤(Σj−1+Σi−1)(μi−μj)−k)=max (α−1)i=1∑nlogdet(Σi)+αi=j∑([Tr(Σj−1Σi)+Tr(Σi−1Σj)]+(μi−μj)⊤(Σj−1+Σi−1)(μi−μj))(4.1)
当任意第 i i i 类目标原始观测分布 p i p_i pi 为各项独立高斯分布时,有 Σ i \pmb{\Sigma}_i Σi 是对角矩阵,设为 Σ i = diag ( σ i 1 , σ i 2 , . . . , σ i k ) \pmb{\Sigma}_i = \text{diag} \left(\sigma_{i1}, \sigma_{i2},...,\sigma_{ik}\right) Σi=diag(σi1,σi2,...,σik),上式可表示为
max J = max ( α − 1 ) ∑ i = 1 n ∑ d = 1 k log σ i d + α ( ∑ i ≠ j ∑ d = 1 k ( σ i d σ j d + σ j d σ i d ) + ∑ i ≠ j ( μ i − μ j ) ⊤ ( Σ j − 1 + Σ i − 1 ) ( μ i − μ j ) ) (4.2) \begin{aligned} \max \mathcal{J} = \max \space &(\alpha-1) \sum_{i=1}^{n} \sum_{d=1}^{k} \log \sigma_{id}+\\ &\alpha\left(\sum_{i \neq j} \sum_{d=1}^{k}\left(\frac{\sigma_{i d}}{\sigma_{j d}}+\frac{\sigma_{j d}}{\sigma_{i d}}\right)+\sum_{i \neq j}\left(\mu_{i}-\mu_{j}\right)^{\top}\left(\Sigma_{j}^{-1}+\Sigma_{i}^{-1}\right)\left(\mu_{i}-\mu_{j}\right)\right) \end{aligned} \tag{4.2} maxJ=max (α−1)i=1∑nd=1∑klogσid+α i=j∑d=1∑k(σjdσid+σidσjd)+i=j∑(μi−μj)⊤(Σj−1+Σi−1)(μi−μj) (4.2) - 直观理解:
- ( α − 1 ) ∑ i = 1 n log det ( Σ i ) (\alpha-1) \sum_{i=1}^n \log \operatorname{det}(\boldsymbol{\Sigma_i}) (α−1)∑i=1nlogdet(Σi):本项中 ( α − 1 ) < = 0 (\alpha-1)<=0 (α−1)<=0,故此处希望每个分布 p i p_i pi 的协方差矩阵行列式尽量小。协方差矩阵行列式的值反映了数据分布的总体不确定性,行列式越小,表示数据在各个维度上的分布越集中
- α ∑ i ≠ j ( Tr ( Σ j − 1 Σ i ) + Tr ( Σ i − 1 Σ j ) ) \alpha\sum_{i\neq j} \left(\operatorname{Tr}\left(\Sigma_{j}^{-1} \Sigma_{i}\right)+\operatorname{Tr}\left(\Sigma_{i}^{-1} \Sigma_{j}\right)\right) α∑i=j(Tr(Σj−1Σi)+Tr(Σi−1Σj)) :矩阵的迹等于矩阵特征值之和,特征值代表线性变换对空间的拉伸程度,因此迹反映了线性变换对空间的拉伸和压缩的总量。 Σ j − 1 Σ i \Sigma_{j}^{-1} \Sigma_{i} Σj−1Σi 可以看作对空间连续施加两个线性变换, Σ j = Σ i \Sigma_{j}=\Sigma_{i} Σj=Σi 时对空间没有变换,此项取得最小,反之二者对空间每一维度拉伸和压缩程度差异越大(特征值比例越大)时,此项越大。故此项意味着不同分布的散布特性应尽量不同
- α ∑ i ≠ j ( ( μ i − μ j ) ⊤ ( Σ j − 1 + Σ i − 1 ) ( μ i − μ j ) ) \alpha\sum_{i\neq j} \left(\left(\mu_{i}-\mu_{j}\right)^{\top}(\Sigma_{j}^{-1} + \Sigma_{i}^{-1}) \left(\mu_{i}-\mu_{j}\right)\right) α∑i=j((μi−μj)⊤(Σj−1+Σi−1)(μi−μj)):本项表示均值向量之间的差异,权重由协方差矩阵的逆矩阵决定。它衡量了均值向量之间的方差加权距离