1、目的
用transformer来替代U-Net backbone,提升生成效果
2、方法
Diffusion Transformers (DiTs)
1)结构
Latent Diffusion Models (LDMs)
-> Transformer (Vision Transformer, ViT) based DDPM
-> off-the-shelf convolutional VAE
2)patchify
-> converts the spatial input into a sequence of T tokens
-> linearly embedding each patch in the input to dimension d
-> ViT frequency-based positional embeddings (the sine-cosine version)
3)conditional information
-> In-context conditioning
negligible Gflops
-> Cross-attention block
most Gflops
-> Adaptive layer norm (adaLN) block
least Gflops
-> adaLN-Zero block
regress and
regress dimension-wise scaling parameters that are applied immediately prior to any residual connections within the DiT block
initialize the MLP to output the zero-vector for all
negligible Gflops
4)Transformer decoder
linearly decode each token into a
rearrange the decoded tokens into their original spatial layout
3、实验
1)model size ↑ (transformer deeper and wider),patch size ↓,模型表现↑
2)Gflops, rather than parameter counts, determines the quality of a DiT model
3)larger DiT models are more compute-efficient