论文网址:Variational Bayesian Last Layers (arxiv.org)
论文代码:GitHub - VectorInstitute/vbll: Simple (and cheap!) neural network uncertainty estimation
英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用
1. 省流版
1.1. 心得
(1)挺普适的亚子
1.2. 论文总结图
2. 论文逐段精读
2.1. Abstract
①Characteristics of model: sampling-free, single pass and loss (?)
②Advantages: plug and play
2.2. Introduction
①They aims to correct the uncertainty quantification
②Contributions: proposed variational Bayesian last layers (VBLLs), parameterized model, outperformed baselines and released a package
2.3. Bayesian Last Layer Neural Networks
①“回顾了贝叶斯最后一层模型,该模型仅对神经网络中的最后一层保持后验分布”(啥意思??意思是其他层都不包含后验是吗)
② input with corresponding output (classification) . And is a one hot lable set
③Neural network:
the is acturally a weight in the last layer of neural network
2.3.1. Regression
①Traditional Bayesian last layer (BLL):
is the noise of Gauss distribution (i.i.d.)
②Assuming a Gaussian prior:
③Predictive distribution:
where denotes the parameters of distribution
2.3.2. Discriminative Classification
①The specific BLL classification:
②Unnormalized joint data-label log likelihoods:
where is a normalizing constant
2.3.3. Generative Classification
①"Placing a Normal prior on the means of these feature distributions and a (conjugate) Dirichlet prior on class probabilities, we have priors and likelihoods (top line and bottom line respectively) of the form":
where is the prior mean, denotes the covariance over
②Distribution of parameters:
③Marginalization analysis:
where
④Prediction by Bayes' rule:
where
where is a constant shared by all the classes, and it can be ignored in that the shift-invariance of the softmax
⑤95% predictive credible region and visualization:
2.3.4. Inference and Training in BLL Models
①By gradient descent, the (log) marginal likelihood:
where the full marginal likelihood may bring ubstantial over-concentration of the approximate posterior
2.4. Sampling-Free Variational Inference for BLL Networks
①To approximate a margin, they develop bounds of the form:
where is the parameter in the last layer, is the approximating posterior
2.4.1. Regression
①When is the variational posterior, then:
when and distributional assumptions are satisfied, the lower bound is tight
2.4.2. Discriminative Classification
①When is the variational posterior, then:
where is the log-sum-exp function, is the sum, is the parameter, , and is the variational posterior. And the bound is the standard ELBO
2.4.3. Generative Classification
①When is the variational posterior, then:
where is the exact Dirichlet posterior over class probabilities, denotes the Dirichlet posterior concentration parameters, is the digamma function, . All will vanish in gradient computation. The bound is ELBO
2.4.4. Training VBLL Models
(1)Full training
①Training goal:
isotropic adj.各向同性的;等方性的
(2)Post-training
①Different training methods from full traning
(3)Feature uncertainty
①Combining SVI and variational feature learning:
②Collapse this expectation:
2.4.5. Prediction with VBLL Models
①For classification task:
②For generation or regression:
conjugacy n.共轭性
2.5. Related Work and Discussion
①Introducing the development of Bayes
2.6. Experiments
2.6.1. Regression
①Comparison table in different datasets:
2.6.2. Image Classification
①Comparison table in CIFAR-10 and CIFAR-100:
2.6.3. Sentiment Classification with LLM Features
①Comparison of G-VBLL, D-VBLL and MLP on IMDB Sentiment Classification Dataset:
2.6.4. Wheel Bandit
①Wheel bandit cumulative regret:
②Wheel bandit simple regret:
2.7. Conclusions and Future Work
VBLL is a universal module
3. 知识补充
3.1. Sampling-free
“sampling-free”通常指的是在进行某种处理或分析时,不需要对数据进行采样或选择一部分数据。相反,它会使用完整的数据集进行处理,以提供更准确、更全面的结果。
3.2. Single pass
“single pass”通常指的是在处理数据或执行某种算法时,只对整个数据集进行一次遍历或处理。
在数据处理或算法设计中,single pass方法通常用于优化性能和减少计算成本。通过只进行一次遍历,可以更快地处理大量数据,并减少内存使用和存储需求。
4. Reference List
Harrison, J., Willes, J., & Snoek, J. (2024) 'Variational Bayesian Last Layers', ICLR. doi: https://doi.org/10.48550/arXiv.2404.11599