Linear independent component analysis (ICA)
x i ( k ) = ∑ j = 1 n a i j s j ( k ) for all i = 1 … n , k = 1 … K ( ) x_i(k) = \sum_{j=1}^{n} a_{ij}s_j(k) \quad \text{for all } i = 1 \ldots n, k = 1 \ldots K \tag{} xi(k)=j=1∑naijsj(k)for all i=1…n,k=1…K()
- x i ( k ) x_i(k) xi(k) is the i i i-th observed signal in sample point k k k (possibly time)
- a i j a_{ij} aij constant parameters describing “mixing”
- Assuming independent, non-Gaussian latent “sources” s j s_j sj
- ICA is identifiable, i.e. well-defined. Observing only x i x_i xi we can recover both a i j a_{ij} aij and s j s_j sj .
Fundamental difference between ICA and PCA
- PCA doesn’t find the original coordinates, ICA does.
- PCA, Gaussian factor analysis are not identifiable:
- Any orthogonal rotation is equivalent: s ′ = U s s' = Us s′=Us has same distribution.
Nonlinear ICA is an unsolved problem
-
Extend ICA to nonlinear case to get general disentanglement?
-
Unfortunately, “basic” nonlinear ICA is not identifiable:
-
If we define nonlinear ICA model for random variables ( x_i ) as
x i = f i ( s 1 , … , s n ) , i = 1 … n x_i = f_i(s_1, \ldots, s_n) , i = 1 \ldots n xi=fi(s1,…,sn),i=1…n
we cannot recover original sources (Darmois, 1952; Hyvärinen & Pajunen, 1999)
Darmois construction
-
Darmois (1952) showed the impossibility of nonlinear ICA:
-
For any x 1 , x 2 x_1, x_2 x1,x2, can always construct y = g ( x 1 , x 2 ) y = g(x_1, x_2) y=g(x1,x2) independent of x 1 x_1 x1 as
g ( ξ 1 , ξ 2 ) = P ( x 2 < ξ 2 ∣ x 1 = ξ 1 ) g(\xi_1, \xi_2) = P(x_2 < \xi_2 | x_1 = \xi_1) g(ξ1,ξ2)=P(x2<ξ2∣x1=ξ1)
-
Independence alone too weak for identifiability:
- We could take x 1 x_1 x1 as an independent component which is absurd
-
Looking at non-Gaussianity equally absurd:
- Scalar transform h ( x 1 ) h(x_1) h(x1) can give any distribution
Time-contrastive learning
- Observe n n n-dim time series x ( t ) x(t) x(t)
- Divide x ( t ) x(t) x(t) into T T T segments (e.g., bins with equal sizes)
- Train MLP to tell which segment a single data point comes from
- Number of classes is T T T
- Labels given by index of segment
- Multinomial logistic regression
- In hidden layer h h h, NN should learn to represent nonstationarity 非平稳性 (= differences between segments)
- Could this really do Nonlinear ICA?
- Assume data follows nonlinear ICA model
x
(
t
)
=
f
(
s
(
t
)
)
x(t) = f(s(t))
x(t)=f(s(t)) with
- smooth, invertible nonlinear mixing f : R n → R n f : \mathbb{R}^n \rightarrow \mathbb{R}^n f:Rn→Rn
- components s i ( t ) s_i(t) si(t) are nonstationary, e.g., in variances
- Assume we apply time-contrastive learning on
x
(
t
)
x(t)
x(t)
- using MLP with hidden layer in h ( x ( t ) ) h(x(t)) h(x(t)) with dim ( h ) = dim ( x ) \text{dim}(h) = \text{dim}(x) dim(h)=dim(x)
- Then, TCL will find s ( t ) 2 = A h ( x ( t ) ) s(t)^2 = Ah(x(t)) s(t)2=Ah(x(t)) for some linear mixing matrix A A A. (Squaring is element-wise)
- I.e.: TCL demixes nonlinear ICA model up to linear mixing (which can be estimated by linear ICA) and up to squaring.
- This is a constructive proof of identifiability
- Imposing independence at every segment -> more constraints -> unique solution. 增加了限制保证了indentifiability
用MLP,通过自监督分类(某一个信号来自于哪个时间段)来训练网络。这样MLP可以表示不同时间段内的信号差。而后原始信号 s 2 s^2 s2 可以表示为观测值(x)经MLP隐藏层分离结果的线性组合。
Deep Latent Variable Models
-
General framework with observed data vector x x x and latent s s s:
p ( x , s ) = p ( x ∣ s ) p ( s ) , p ( x ) = ∫ p ( x , s ) d s p(x, s) = p(x|s)p(s), \quad p(x) = \int p(x, s)ds p(x,s)=p(x∣s)p(s),p(x)=∫p(x,s)ds
where θ \theta θ is a vector of parameters, e.g., in a neural network -
In variational autoencoders (VAE):
- Define prior so that s s s white Gaussian (thus s i s_i si; all independent)
- Define posterior so that x = f ( s ) + n x = f(s) + n x=f(s)+n
-
Looks like Nonlinear ICA, but not identifiable
- By Gaussianity, any orthogonal rotation is equivalent:
s ′ = M s has exactly the same distribution if M T M = I s' = Ms \text{ has exactly the same distribution if } M^TM = I s′=Ms has exactly the same distribution if MTM=I
- By Gaussianity, any orthogonal rotation is equivalent:
Conditioning DLVM’s by another variable
通过引入一个新的变量u来解,比如找视频和音频的关系,时间t就可以作为辅助变量(auxiliary varibale)。通过条件独立(conditional independent)来解。
Conclusion
-
Typical deep learning needs class labels, or some targets
-
If no class labels: unsupervised learning
-
Independent component analysis is a principled approach
- can be made nonlinear
-
Identifiable: Can recover components that actually created the data (unlike PCA, VAE etc)
-
Special assumptions needed for identifiability, one of:
- Nonstationarity (“time-contrastive learning”)
- Temporal dependencies (“permutation-contrastive learning”)
- Existence of auxiliary (conditioning) variable (e.g., “iVAE”)
-
Self-supervised methods are easy to implement
-
Connection to DLVM’s can be made → iVAE
-
Principled framework for “disentanglement”
总结来说Linear ICA是可解的,对于Nonlinear ICA则需要增加额外的假设才能可解(原始信号可分离)。Nonlinear ICA的思想可以用在深度学习的其他模型上。
Reference
- https://www.youtube.com/watch?v=_cBLSNRWt8c