SORA大模型的一点分析与理解

Overview

SORA
- 一、原始技术博客分析
- - 1、Overview
  - 2、视频生成相关工作
  - 3、Turning visual data into patches
  - 4、Video compression network
  - 5、Spacetime latent patches
  - 6、Scaling transformers for video generation
  - 7、Variable durations, resolutions, aspect ratios
  - 8、Language understanding
  - 9、Prompting with images and videos
  - 10、Image generation capabilities
  - 11、Emerging simulation capabilities
  - 12、Discussion
  - 13、最值得参考的几篇文献（标黄更为重要）
  - 14、SORA模型结构图

SORA

一、原始技术博客分析

题目: Video generation models as world simulators
机构：OPEN AI
博客地址: https://openai.com/research/video-generation-models-as-world-simulators

1、Overview

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios（可变时长，分辨率，长宽比）. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

This technical report focuses on

Our method for turning visual data of all types into a unified representation that enables large-scale training of generative models
Qualitative evaluation of Sora’s capabilities and limitations.
Model and implementation details are not included in this report.

2、视频生成相关工作

Much prior work has studied generative modeling of video data using a variety of methods, including

Recurrent networks
- Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. “Unsupervised learning of video representations using lstms.” International conference on machine learning. PMLR, 2015.↩︎
- Chiappa, Silvia, et al. “Recurrent environment simulators.” arXiv preprint arXiv:1704.02254 (2017).↩︎
- Ha, David, and Jürgen Schmidhuber. “World models.” arXiv preprint arXiv:1803.10122 (2018).↩︎代码：https://blog.otoro.net/2018/06/09/world-models-experiments/
- - 在上面的世界模型当中，旨在训练一个大的神经网络来处理强化学习任务，其将agent划分为一个大的世界模型以及一个小的控制模型。首先用一种无监督的方式来让模型学习到agent的世界，然后训练一个小的控制模型来利用这个世界模型去执行特定的任务。
    
    其中V这一块，用的是一个VAE模型，来将每一个图像帧压缩为一个更小的隐向量（latent vector） $z$ 。
    如果说V模型的目的是在压缩agent每一个时间帧看到的什么，那么M模型就是在压缩历史时刻发生了什么，然后用于预测将来。
    
    这儿实际就是在预测（future z vectors that V is expected to produce），但是又不是直接预测 $z$ 向量，而是利用RNN结合混合高斯模型（MDN）来预测 $p (z)$ 的概率密度，具体而言，MDN输出的是混合高斯分布的参数，然后进行采样，得到下一个latent vector $z$ 的预测。类似的MDN-RNN的结构在诸如Sketch-RNN等工作中被用到，用于预测草图绘画中的下一个笔画，而在世界模型这篇工作中用的预测的是隐变量 $z$ 。
    控制器模型，用一个简单的线性层来进行实现，并且独立于V和M训练，输入是 $z_t$ 和 $h_t$ ，输出的action $a_t$ 。
    最终将V，M，C放在一起就会是如下的一个结构：
    
    整个实验算法的流程图如下：实验是基于openai gym环境
Generative adversarial networks
- Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Generating videos with scene dynamics.” Advances in neural information processing systems 29 (2016).↩︎MIT
- Tulyakov, Sergey, et al. “Mocogan: Decomposing motion and content for video generation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.↩︎英伟达
- Clark, Aidan, Jeff Donahue, and Karen Simonyan. “Adversarial video generation on complex datasets.” arXiv preprint arXiv:1907.06571 (2019).↩︎谷歌
- Brooks, Tim, et al. “Generating long videos of dynamic scenes.” Advances in Neural Information Processing Systems 35 (2022): 31769-31781.↩︎（英伟达，long video gan）代码：https://github.com/NVlabs/long-video-gan
Autoregressive transformers
- Wu, Chenfei, et al. “Nüwa: Visual synthesis pre-training for neural visual world creation.” European conference on computer vision. Cham: Springer Nature Switzerland, 2022.↩︎微软北大
- Yan, Wilson, et al. “Videogpt: Video generation using vq-vae and transformers.” arXiv preprint arXiv:2104.10157 (2021).↩︎UC Berkeley，代码：https://github.com/wilson1yan/VideoGPT
Diffusion models
- Ho, Jonathan, et al. “Imagen video: High definition video generation with diffusion models.” arXiv preprint arXiv:2210.02303 (2022).↩︎谷歌
- Blattmann, Andreas, et al. “Align your latents: High-resolution video synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.↩︎英伟达
- Gupta, Agrim, et al. “Photorealistic video generation with diffusion models.” arXiv preprint arXiv:2312.06662 (2023).↩︎斯坦福，谷歌

3、Turning visual data into patches

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.

Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).↩︎
Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.↩︎↩︎

The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.

Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).↩︎↩︎
Arnab, Anurag, et al. “Vivit: A video vision transformer.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.↩︎↩︎谷歌，代码：https://github.com/google-research/scenic/tree/main/scenic/projects/vivit
He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.↩︎↩︎
Dehghani, Mostafa, et al. “Patch n’Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution.” arXiv preprint arXiv:2307.06304 (2023).↩︎↩︎谷歌

We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.
在这里插入图片描述

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,19 and subsequently decomposing the representation into spacetime patches.

Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

4、Video compression network

We train a network that reduces the dimensionality of visual data.（利用VAE） This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).↩︎

5、Spacetime latent patches

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

6、Scaling transformers for video generation

Sora is a diffusion model21,22,23,24,25;

Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International conference on machine learning. PMLR, 2015.↩︎
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.↩︎
Nichol, Alexander Quinn, and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” International Conference on Machine Learning. PMLR, 2021.↩︎
Dhariwal, Prafulla, and Alexander Quinn Nichol. “Diffusion Models Beat GANs on Image Synthesis.” Advances in Neural Information Processing Systems. 2021.↩︎OPENAI
Karras, Tero, et al. “Elucidating the design space of diffusion-based generative models.” Advances in Neural Information Processing Systems 35 (2022): 26565-26577.↩︎

given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.26

Peebles, William, and Saining Xie. “Scalable diffusion models with transformers.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.↩︎UC Berkeley，纽约大学

Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,computer vision,and image generation.27,28,29

Chen, Mark, et al. “Generative pretraining from pixels.” International conference on machine learning. PMLR, 2020.↩︎OPENAI
Ramesh, Aditya, et al. “Zero-shot text-to-image generation.” International Conference on Machine Learning. PMLR, 2021.↩︎OPENAI DALLE
Yu, Jiahui, et al. “Scaling autoregressive models for content-rich text-to-image generation.” arXiv preprint arXiv:2206.10789 2.3 (2022): 5. Parti 谷歌

在这里插入图片描述 In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

7、Variable durations, resolutions, aspect ratios

Past approaches to image and video generation typically resize, crop or trim videos to a standard size—e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.
Sampling flexibility
Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.
在这里插入图片描述 Improved framing and composition
We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right) have improved framing.

8、Language understanding

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3 to videos.

Betker, James, et al. “Improving image generation with better captions.” Computer Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023): 8

We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.

9、Prompting with images and videos

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.

Animating DALL·E images

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 231 and DALL·E 330 images.

Extending generated videos

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.

Video-to-video editing

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit,32 to Sora.

Meng, Chenlin, et al. “Sdedit: Guided image synthesis and editing with stochastic differential equations.” arXiv preprint arXiv:2108.01073 (2021).↩︎斯坦福，卡内基梅隆

This technique enables Sora to transform the styles and environments of input videos zero-shot.

Connecting videos

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.

10、Image generation capabilities

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.

11、Emerging simulation capabilities

We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.

3D consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.

Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.

Interacting with the world. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.

Simulating digital worlds. Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.

12、Discussion

Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page.

We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.