1 Title
Zero-Shot Text-to-Image Generation(Aditya Ramesh 、 Mikhail Pavlov 、Gabriel Goh Scott Gray、 Chelsea Voss 、 Alec Radford 、 Mark Chen 、Ilya Sutskever)
2 Conclusion
This study describes a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
3 Good Sentences
1、However, using pixels directly as image tokens would require an inordinate amount of memory for high-resolution images. Likelihood objectives tend to prioritize modeling short-range dependencies between pixels, so much of the modeling capacity would be spent capturing high-frequency details instead of the low-frequency structure that makes objects visually recognizable to us.
2、As the model is made deeper and wider, the true exponents of the activation gradients for later resblocks can fall below the minimum exponent of the 16-bit format. Consequently, they get rounded to zero, a phenomenon called underflow. We found that eliminating underflow allowed for stable training to convergence(The improvement of this study has done)
3、Training the transformer on the tokens from the dVAE encoder allows us to allocate its modeling capacity to the low-frequency information that makes images visually recognizable to us. However, it also disadvantages the model, since the heavy compression renders it unable to produce high-frequency details.(The shortcomings of this work)
本文的目的是训练一个Transformer,它能够将文本和图像tokens自回归建模成单独的数据流,但是直接用像素当做图像tokens的话需要较高的内存,而似然目标则优先考虑像素之间的短程关依赖关系建模。本文通过使用两阶段训练程序来解决这些问题。
步骤1,学习Visual Codebook
:训练一个离散变分自编码器(dVAE,将每个256×256 RGB图像压缩成一个32 × 32的图像标记网格,其中每个元素可以假设8192个可能值,这将Transformer的上下文大小减少了192倍,而视觉质量没有大的下降。这一步骤对应于训练dVAE
步骤2,学习先验知识:我们将多达256个BPE编码的文本tokens与32 × 32 = 1024个图像tokens连接起来,并训练自回归Transformer来模拟文本和图像tokens上的联合分布。
整个过程可以看作是最大化证据下限。