SORA大模型的一点分析与理解

Overview

  • SORA
    • 一、原始技术博客分析
      • 1、Overview
      • 2、视频生成相关工作
      • 3、Turning visual data into patches
      • 4、Video compression network
      • 5、Spacetime latent patches
      • 6、Scaling transformers for video generation
      • 7、Variable durations, resolutions, aspect ratios
      • 8、Language understanding
      • 9、Prompting with images and videos
      • 10、Image generation capabilities
      • 11、Emerging simulation capabilities
      • 12、Discussion
      • 13、最值得参考的几篇文献(标黄更为重要)
      • 14、SORA模型结构图

SORA

一、原始技术博客分析

题目: Video generation models as world simulators
机构:OPEN AI
博客地址: https://openai.com/research/video-generation-models-as-world-simulators

1、Overview

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios(可变时长,分辨率,长宽比). We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

This technical report focuses on

  1. Our method for turning visual data of all types into a unified representation that enables large-scale training of generative models
  2. Qualitative evaluation of Sora’s capabilities and limitations.
  3. Model and implementation details are not included in this report.

2、视频生成相关工作

Much prior work has studied generative modeling of video data using a variety of methods, including

  • Recurrent networks

    • Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. “Unsupervised learning of video representations using lstms.” International conference on machine learning. PMLR, 2015.↩︎
    • Chiappa, Silvia, et al. “Recurrent environment simulators.” arXiv preprint arXiv:1704.02254 (2017).↩︎
    • Ha, David, and Jürgen Schmidhuber. “World models.” arXiv preprint arXiv:1803.10122 (2018).↩︎代码:https://blog.otoro.net/2018/06/09/world-models-experiments/
      • 在上面的世界模型当中,旨在训练一个大的神经网络来处理强化学习任务,其将agent划分为一个大的世界模型以及一个小的控制模型。首先用一种无监督的方式来让模型学习到agent的世界,然后训练一个小的控制模型来利用这个世界模型去执行特定的任务。
        BLIP model
        其中V这一块,用的是一个VAE模型,来将每一个图像帧压缩为一个更小的隐向量(latent vector) z z z
        如果说V模型的目的是在压缩agent每一个时间帧看到的什么,那么M模型就是在压缩历史时刻发生了什么,然后用于预测将来。
        BLIP model
        这儿实际就是在预测(future z vectors that V is expected to produce),但是又不是直接预测 z z z向量,而是利用RNN结合混合高斯模型(MDN)来预测 p ( z ) p(z) p(z)的概率密度,具体而言,MDN输出的是混合高斯分布的参数,然后进行采样,得到下一个latent vector z z z的预测。类似的MDN-RNN的结构在诸如Sketch-RNN等工作中被用到,用于预测草图绘画中的下一个笔画,而在世界模型这篇工作中用的预测的是隐变量 z z z
        控制器模型,用一个简单的线性层来进行实现,并且独立于V和M训练,输入是 z t z_t zt h t h_t ht,输出的action a t a_t at
        最终将V,M,C放在一起就会是如下的一个结构:
        BLIP model
        整个实验算法的流程图如下:实验是基于openai gym环境
        BLIP model

  • Generative adversarial networks

    • Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Generating videos with scene dynamics.” Advances in neural information processing systems 29 (2016).↩︎MIT
    • Tulyakov, Sergey, et al. “Mocogan: Decomposing motion and content for video generation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.↩︎英伟达
    • Clark, Aidan, Jeff Donahue, and Karen Simonyan. “Adversarial video generation on complex datasets.” arXiv preprint arXiv:1907.06571 (2019).↩︎谷歌
    • Brooks, Tim, et al. “Generating long videos of dynamic scenes.” Advances in Neural Information Processing Systems 35 (2022): 31769-31781.↩︎(英伟达,long video gan)代码:https://github.com/NVlabs/long-video-gan
  • Autoregressive transformers

    • Wu, Chenfei, et al. “Nüwa: Visual synthesis pre-training for neural visual world creation.” European conference on computer vision. Cham: Springer Nature Switzerland, 2022.↩︎微软北大
    • Yan, Wilson, et al. “Videogpt: Video generation using vq-vae and transformers.” arXiv preprint arXiv:2104.10157 (2021).↩︎UC Berkeley,代码:https://github.com/wilson1yan/VideoGPT
  • Diffusion models

    • Ho, Jonathan, et al. “Imagen video: High definition video generation with diffusion models.” arXiv preprint arXiv:2210.02303 (2022).↩︎谷歌
    • Blattmann, Andreas, et al. “Align your latents: High-resolution video synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.↩︎英伟达
    • Gupta, Agrim, et al. “Photorealistic video generation with diffusion models.” arXiv preprint arXiv:2312.06662 (2023).↩︎斯坦福,谷歌

3、Turning visual data into patches

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.

  • Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).↩︎
  • Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.↩︎↩︎

The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.

  • Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).↩︎↩︎
  • Arnab, Anurag, et al. “Vivit: A video vision transformer.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.↩︎↩︎谷歌,代码:https://github.com/google-research/scenic/tree/main/scenic/projects/vivit
  • He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.↩︎↩︎
  • Dehghani, Mostafa, et al. “Patch n’Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution.” arXiv preprint arXiv:2307.06304 (2023).↩︎↩︎谷歌

We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.
在这里插入图片描述

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,19 and subsequently decomposing the representation into spacetime patches.

  • Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

4、Video compression network

We train a network that reduces the dimensionality of visual data.(利用VAE) This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

  • Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).↩︎

5、Spacetime latent patches

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

6、Scaling transformers for video generation

Sora is a diffusion model21,22,23,24,25;

  • Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International conference on machine learning. PMLR, 2015.↩︎
  • Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.↩︎
  • Nichol, Alexander Quinn, and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” International Conference on Machine Learning. PMLR, 2021.↩︎
  • Dhariwal, Prafulla, and Alexander Quinn Nichol. “Diffusion Models Beat GANs on Image Synthesis.” Advances in Neural Information Processing Systems. 2021.↩︎OPENAI
  • Karras, Tero, et al. “Elucidating the design space of diffusion-based generative models.” Advances in Neural Information Processing Systems 35 (2022): 26565-26577.↩︎

given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.26

  • Peebles, William, and Saining Xie. “Scalable diffusion models with transformers.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.↩︎UC Berkeley,纽约大学

Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,computer vision,and image generation.27,28,29

  • Chen, Mark, et al. “Generative pretraining from pixels.” International conference on machine learning. PMLR, 2020.↩︎OPENAI
  • Ramesh, Aditya, et al. “Zero-shot text-to-image generation.” International Conference on Machine Learning. PMLR, 2021.↩︎OPENAI DALLE
  • Yu, Jiahui, et al. “Scaling autoregressive models for content-rich text-to-image generation.” arXiv preprint arXiv:2206.10789 2.3 (2022): 5. Parti 谷歌

在这里插入图片描述In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

7、Variable durations, resolutions, aspect ratios

Past approaches to image and video generation typically resize, crop or trim videos to a standard size—e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.
Sampling flexibility
Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.
在这里插入图片描述Improved framing and composition
We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right) have improved framing.

8、Language understanding

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3 to videos.

  • Betker, James, et al. “Improving image generation with better captions.” Computer Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023): 8

We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.

9、Prompting with images and videos

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.

Animating DALL·E images

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 231 and DALL·E 330 images.

Extending generated videos

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.

Video-to-video editing

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit,32 to Sora.

  • Meng, Chenlin, et al. “Sdedit: Guided image synthesis and editing with stochastic differential equations.” arXiv preprint arXiv:2108.01073 (2021).↩︎斯坦福,卡内基梅隆

This technique enables Sora to transform the styles and environments of input videos zero-shot.

Connecting videos

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.

10、Image generation capabilities

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.

11、Emerging simulation capabilities

We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.

3D consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.

Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.

Interacting with the world. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.

Simulating digital worlds. Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.

12、Discussion

Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page.

We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.

13、最值得参考的几篇文献(标黄更为重要)

World models (rnn based + RL)Google Brain,2018.05

Generating Long Videos of Dynamic Scenes (gan based)NVIDIA 2022.06

VideoGPT: Video Generation using VQ-VAE and Transformers(Autoregressive transformers) UC Berkeley 2021.04

(绿色星星)ViViT: A Video Vision Transformer 谷歌2021,code: vivit

(绿色星星)Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution Google DeepMind 2023.07(连带Pix2Struct)

Scaling autoregressive models for content-rich text-to-image generation (Parti)谷歌2022.06

(绿色星星)Scalable Diffusion Models with Transformers (DiT,transformer with Diffusion)facebook,UC Berkeley,纽约大学 2022.12,code:DiT

Improving Image Generation with Better Captions(DALL-E3)openai,2023

技术博客未提到,但是也比较重要:

Masked Autoencoders As Spatiotemporal Learners Meta neurips 2022

VICREG: VARIANCE-INVARIANCE-COVARIANCE RE- GULARIZATION FOR SELF-SUPERVISED LEARNING (VICReg) Meta 2022.01

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)Meta 2023.01

Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA) Meta 2024.2

14、SORA模型结构图

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/397683.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

实现VLAN间通信以太网链路聚合与交换机堆叠、集群华为ICT网络赛道

10.实现VLAN间通信 10.1.使用路由器实现VLAN间通信 使用路由器物理接口 路由器三层接口作为网关,转发本网段前往其它网段的流量。 路由器三层接口无法处理携带VLAN Tag的数据帧,因此交换机上联路由器的接口需配置为Access. 路由器的一个物理接口作为一…

数据分析案例-2023年TOP100国外电影数据可视化

🤵‍♂️ 个人主页:艾派森的个人主页 ✍🏻作者简介:Python学习者 🐋 希望大家多多支持,我们一起进步!😄 如果文章对你有帮助的话, 欢迎评论 💬点赞&#x1f4…

vue3 之 商城项目—会员中心

整体功能梳理 1️⃣个人中心—个人信息和猜你喜欢数据渲染 2️⃣我的订单—各种状态下的订单列表展示 路由配置&#xff08;三级路由配置&#xff09; 准备模版member/index.vue <script setup> </script><template><div class"container">…

redis分布式锁redisson

文章目录 1. 分布式锁1.1 基本原理和实现方式对比synchronized锁在集群模式下的问题多jvm使用同一个锁监视器分布式锁概念分布式锁须满足的条件分布式锁的实现 1.2 基于Redis的分布式锁获取锁&释放锁操作示例 基于Redis实现分布式锁初级版本ILock接口SimpleRedisLock使用示…

条码扫描器

介绍 条码扫描器&#xff0c;又称为条码阅读器、条码扫描枪、条形码扫描器、条形码扫描枪及条形码阅读器。它是用于读取条码所包含信息的阅读设备&#xff0c;利用光学原理&#xff0c;把条形码的内容解码后通过数据线或者无线的方式传输到电脑或者别的设备。广泛应用于超市、物…

idea代码review工具Code Review Helper使用介绍

之前在团队里面遇到一个关于代码review的问题&#xff0c;使用gitlab自己的还是facebook的Phabricator&#xff0c;很难看到整体逻辑&#xff0c;因为业务逻辑代码可能不在这次改动范围内&#xff0c;在去源库中找不好找。针对这个刚需&#xff0c;在网上找了一个idea的代码工具…

区块链革命:Web3如何改变我们的生活

随着技术的不断发展&#xff0c;区块链技术作为一种去中心化的分布式账本技术&#xff0c;正逐渐成为数字世界的核心。Web3作为区块链技术的重要组成部分&#xff0c;正在引领着数字化时代的变革&#xff0c;其影响已经开始渗透到我们生活的方方面面。本文将深入探讨区块链革命…

ALBEF算法解读

ALBEF论文全名Align before Fuse: Vision and Language Representation Learning with Momentum Distillation&#xff0c;来自于Align before Fuse&#xff0c;作者团队为Salesforce Research。 论文地址&#xff1a;https://arxiv.org/pdf/2107.07651.pdf 论文代码&#xff1…

SICTF round#3 web

1.100&#xff05;_upload url可以进行文件包含&#xff0c;但是flag被过滤 看一下源码 <?phpif(isset($_FILES[upfile])){$uploaddir uploads/;$uploadfile $uploaddir . basename($_FILES[upfile][name]);$ext pathinfo($_FILES[upfile][name],PATHINFO_EXTENSION);$t…

大模型量化技术原理-LLM.int8()、GPTQ

近年来&#xff0c;随着Transformer、MOE架构的提出&#xff0c;使得深度学习模型轻松突破上万亿规模参数&#xff0c;从而导致模型变得越来越大&#xff0c;因此&#xff0c;我们需要一些大模型压缩技术来降低模型部署的成本&#xff0c;并提升模型的推理性能。 模型压缩主要分…

react开发者必备vscode插件【2024最新】

React开发者必备VSCode插件及使用教程 Visual Studio Code&#xff08;VSCode&#xff09;是当今最流行的代码编辑器之一&#xff0c;特别是在前端开发者中。对于使用React的开发者来说&#xff0c;VSCode不仅因其轻量和高度可定制而受到欢迎&#xff0c;还因为其强大的插件生…

Java项目,营销抽奖系统设计实现

作者&#xff1a;小傅哥 博客&#xff1a;https://bugstack.cn 项目&#xff1a;https://gaga.plus 沉淀、分享、成长&#xff0c;让自己和他人都能有所收获&#xff01;&#x1f604; 大家好&#xff0c;我是技术UP主&#xff0c;小傅哥。 经过这个假期的嘎嘎卷&#x1f9e8;…

8 大内部排序算法图文讲解

排序算法可以分为内部排序和外部排序&#xff0c;内部排序是数据记录在内存中进行排序&#xff0c;而外部排序是因排序的数据很大&#xff0c;一次不能容纳全部的排序记录&#xff0c;在排序过程中需要访问外存。常见的内部排序算法有&#xff1a;插入排序、希尔排序、选择排序…

软件测试面试题常见一百道【含答案】

1、问&#xff1a;你在测试中发现了一个bug&#xff0c;但是开发经理认为这不是一个bug&#xff0c;你应该怎样解决? 首先&#xff0c;将问题提交到缺陷管理库里面进行备案。 然后&#xff0c;要获取判断的依据和标准&#xff1a; 根据需求说明书、产品说明、设计文档等&am…

75.SpringMVC的拦截器和过滤器有什么区别?执行顺序?

75.SpringMVC的拦截器和过滤器有什么区别&#xff1f;执行顺序&#xff1f; 区别 拦截器不依赖与servlet容器&#xff0c;过滤器依赖与servlet容器。拦截器只能对action请求(DispatcherServlet 映射的请求)起作用&#xff0c;而过滤器则可以对几乎所有的请求起作用。拦截器可…

Redis基础和高级使用

文章目录 Redis概述Redis简介Redis特点Redis适合于做Redis不适合于做Redis安装 Redis命令Redis命令Redis的键 Redis数据类型Redis支持的数据类型字符串及相关命令字符串应用场景&#xff1a;列表及相关命令列表应用场景&#xff1a;集合及相关命令集合应用场景&#xff1a;有序…

环信IM Android端实现华为推送详细步骤

首先我们要参照华为的官网去完成 &#xff0c;以下两个配置都是华为文档为我们提供的 1.https://developer.huawei.com/consumer/cn/doc/HMSCore-Guides/android-config-agc-0000001050170137#section19884105518498 2.https://developer.huawei.com/consumer/cn/doc/HMSCore…

[OpenAI]继ChatGPT后发布的Sora模型解析与体验通道

前言 前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家&#xff1a;https://www.captainbed.cn/z ChatGPT体验地址 文章目录 前言OpenAI体验通道Spacetime Latent Patches 潜变量时空碎片, 建构视觉语言系统…

HTTPS(超文本传输安全协议)被恶意请求该如何处理。

HTTPS&#xff08;超文本传输安全协议&#xff09;端口攻击通常是指SSL握手中的一些攻击方式&#xff0c;比如SSL握手协商过程中的暴力破解、中间人攻击和SSL剥离攻击等。 攻击原理 攻击者控制受害者发送大量请求&#xff0c;利用压缩算法的机制猜测请求中的关键信息&#xf…

【压缩感知基础】Nyquist采样定理

Nyquist定理&#xff0c;也被称作Nyquist采样定理&#xff0c;是由哈里奈奎斯特在1928年提出的&#xff0c;它是信号处理领域的一个重要基础定理。它描述了连续信号被离散化为数字信号时&#xff0c;采样的要求以避免失真。 数学表示 Nyquist定理的核心内容可以描述如下&…