​Inf-DiT:Upsampling Any-Resolution Image、Vidu、MVDiff、Trio-ViT

本文首发于公众号:机器感知

​Inf-DiT:Upsampling Any-Resolution Image、Vidu、MVDiff、Trio-ViT

图片

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

图片

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limited to 1024*1024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is .......

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

图片

We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.......

Space-time Reinforcement Network for Video Object Segmentation

图片

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference sp......

Structured Click Control in Transformer-based Interactive Segmentation

图片

Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on graph neural networks, which adaptively obtains graph nodes via the global similarity of user-clicked Transformer tokens. Then the graph nodes will be aggregated to obtain structured interaction features. Finally, the dual cross-attention will be used to inject structured interaction features into vision Transformer features, thereby enhancing the control of clicks over segmentation results. Extensive experiments demonstrated the proposed algorithm can serve as a general structure in improving Transformer-based interactive segmenta?tion performance. The code and data will be released at......

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

图片

In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario data collected from the internet, which captures the intricacies of user intentions for promoting the practical application of image editing in the real world. (3) High-precision multi-turn editing data annotated by humans, which involves multiple rounds of edits for simulating iterative editing processes. The combination of these diverse data sources makes SEED-Data-Edit a comprehensive and versatile dataset for training language-guided image editing model. We fine-tune a pretrained Multimodal Large Language Model (MLLM) that unifies comprehension and generation with SEED-Data-Edit. T......

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

图片

Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseli......

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

图片

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model qual......

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

图片

Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.......

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

图片

Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose \emph{Trio-ViT} accordingl......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/607486.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

树莓派4-使用systemctl设置开机自启oled播放服务ip地址与logo

一、目标: 开机自启oled显示服务ip与端口,并播放logo 二、过程: 1、出现luma库不存在问题,修改.service文件,增加用户与用户组。在本地测试过程中可以使用python script.py执行python脚本,所以将.servic…

java递归-(迷宫问题)

前面 这里我们来玩个有趣的事情,链接是0221_韩顺平Java_老鼠出迷宫1_哔哩哔哩_bilibili 我们要找的是小老鼠按路径走到右下点 要点 我们这里方法调用时对于引用类型:如java中引用数据类型有哪些?_java引用数据类型-CSDN博客 会共享引用类型…

斯坦福大学的在线密码学课程

密码学是保护计算机系统信息不可或缺的工具。在本课程中,您将了解密码系统的内部工作原理,以及如何在实际应用中正确使用它们。课程首先将详细讨论当强大的对手窃听和篡改流量时,拥有共享密钥的双方如何进行安全通信。我们将研究许多已部署的…

部署Gerapy

1.Gerapy 是什么? Gerapy 是一款基于 Python 3 的分布式爬虫管理框架,它旨在简化和优化分布式爬虫的部署、管理和监控过程。 2.作用与功能? 2.1分布式管理: Gerapy 允许用户在多台机器上部署和管理Scrapy爬虫,实现爬虫…

【计算机毕设】小型企业办公自动化系统+vue - 免费源码(私信领取)

免费领取源码 | 项目完整可运行 | v:chengn7890 诚招源码校园代理! 1. 研究目的 本项目旨在设计并实现一个小型企业办公自动化系统,利用Vue作为前端框架,为企业员工提供便捷的办公管理工具,提升…

基于51单片机的八路抢答器—加随机抽选功能

基于51单片机的八路抢答器 (仿真+程序原理图+设计报告) 功能介绍 具体功能: 1.主持人按键控制开始抢答; 2.开始抢答按下,数码管20秒倒计时; 3.8个按键代表八位选手,谁…

python面向函数

组织好的,可重复利用的,用来实现单一,或相关联功能的代码段,避免重复造轮子,增加程序复用性。 定义方法为def 函数名 (参数) 参数可动态传参,即使用*args代表元组形式**kwargs代表字典形式,代替…

tsconfig 备忘清单

前言 ❝ Nealyang/blog0 使用 ts 已多年,但是貌似对于 tsconfig 总是记忆不清,每次都是 cv 历史项目,所以写了这篇备忘录,希望能帮助到大家。 本文总结整理自 Matt Pocock 的一篇文章3,加以个人理解,并做了…

【爬虫基础1.1课】——requests模块

目录索引 requests模块的作用:实例引入: 特殊情况:锦囊1:锦囊2: 这一个栏目,我会给出我从零开始学习爬虫的全过程。感兴趣的小伙伴可以关注一波,用于复习和新学都是不错的选择。 那么废话不多说&#xff0c…

【Matlab-动画-附源码】3分钟教你用Matlab做一个Lorenz动画

lorenz-x-y-z Lorenz三个维度数据 在科研工作中,经常需要将数据可视化以便更好地理解和传达研究成果。 但大家主要放静态图片,而视频或动画通常比静态图片更具吸引力和表现力。AE, Manim太难学,Matlab就可以用来制作动画。 在这篇博客中&…

Linux-信号执行

1. 信号什么时候被处理 当进程从内核态返回到用户态的时候,进行信号的检测和处理 什么内核态,什么又是用户态呢? 当进程在CPU上运行时,内核态:允许进程访问操作系统的代码和数据,用户态:进程只…

视频降噪算法 hqdn3d 原理分析

视频降噪 视频降噪是一种处理技术,旨在减少视频中的噪声,提高画面质量。噪声可能来自多种源头,包括摄像机的传感器、压缩算法、传输过程中的干扰等。降噪处理对于视频监控、视频会议、电影后期制作以及任何需要高画质输出的应用场景都非常重…

【面经】网络

了解TCP/IP协议,了解常用的网络协议:study-area 一、TCP/IP协议 TCP/IP协议是一组网络通信协议,旨在实现不同计算机之间的信息传输。 1、TCP/IP四层模型: 网络接口层、网络层、传输层和应用层。 网络接口层:定义了数据的格式和…

揭秘抖音快速涨10000粉的方法:巨量千川投流让你轻松快速增粉

抖音已经成为了当今社交平台的热门之一,而如何快速涨粉已经成为了很多人关注的焦点。本文将揭秘一种高效的方式——巨量千川投流,通过官方真实流量和真实粉丝,每天快速涨关注,实现快速增粉1000~10万。 巨量千川投流是一种专业的抖…

Python-VBA函数之旅-pow函数

目录 一、pow函数的常见应用场景 二、pow函数使用注意事项 三、如何用好pow函数? 1、pow函数: 1-1、Python: 1-2、VBA: 2、推荐阅读: 个人主页:神奇夜光杯-CSDN博客 一、pow函数的常见应用场景 Py…

中小学校活动向媒体投稿报道宣传有哪些好方法

作为一所中小学校的教师,我肩负着向外界展示学校风采、宣传校园文化活动的重要使命。起初,每当学校举办特色活动或取得教学成果时,我都会满怀热情地撰写新闻稿,希望通过媒体的平台让更多人了解我们的故事。然而,理想丰满,现实骨感,我很快发现,通过电子邮件向媒体投稿的过程充满…

如何进行资产梳理

前言 为什么要进行资产梳理? 资产梳理方式一: 一、安全防护设备资产 二、对外开放服务项目资产 三、项目外包业务流程资产 资产梳理方式二: 一、业务资源梳理 二、设备资产梳理 三、第三方的服务信息梳理 风险梳理 风险有哪些? 一,账号权限风…

在此计算机上找不到autocad20*你需要安装autocad20*才可以安装此语言包,安装不成功的解决办法

因为AutoCAD2020未卸载干净导致,需要把AutoCAD2020的注册表清理干净,才可以安装 注册表打开,HKEY LOCAL MACHINE SOFTWARE Classesinstaller Products\7D2F3875100F0000102000060BECB6AB AHKEY LOCAL MACHINE SOFTWARE Classesinstaller Pro…

2024.5.9

#include "widget.h" #include "ui_widget.h"Widget::Widget(QWidget *parent): QWidget(parent), ui(new Ui::Widget) {ui->setupUi(this);this->resize(1000,600);this->setFixedSize(1000,600);//设置按钮大小位置完成btn1 new QPushButton(&…

安卓开发--按键跳转页面,按键按下变色

前面已经介绍了一个空白按键工程的建立以及响应方式,可以参考这里:安卓开发–新建工程,新建虚拟手机,按键事件响应。 安卓开发是页面跳转是基础!!!所以本篇博客介绍利用按键实现页面跳转&#…