用于牙科的多任务视频增强

Multi-task Video Enhancement for Dental Interventions

2022 miccai

Abstract

微型照相机牢牢地固定在牙科手机上,这样牙医就可以持续地监测保守牙科手术的进展情况。但视频辅助牙科干预中的视频增强减轻了低光、噪音、模糊和相机握手等降低视觉舒适度的问题。为此,我们引入了一种新的深度网络,用于多任务视频增强,使牙科场景的宏观可视化。特别是,该网络以多尺度方式联合利用视频恢复和时间对齐来有效增强视频。我们对虚幻场景中自然牙齿的视频进行的实验表明,所提出的网络在多任务中获得了接近实时处理的最新结果。我们在https://doi.org/10.34808/1jby-ay90 上发布了video -lab,这是第一个具有多任务标签的牙科视频数据集,以促进相关视频处理应用的进一步研究。

Related Work

UberNet [9] and cross-stitch networks [16] are encoder-focused architectures that propagate task outputs across scales in the encoder.

Multi-modal distillation in PAD-Net [27] and PAP-Net [29] are decoder-focused networks that fuse outputs of task heads to make the final dense predictions but only at a single scale.

MTI-Net [24], which is most similar to our architecture, extends the decoder fusion by propagating task-specific features bottom-up across multiple scales through the encoder.

Instead of propagating the task features in scale-specific distillation modules across scales to the encoder, our network simultaneously propagates task outputs to the encoder and to the task heads in the decoder. Furthermore, the networks make dense task prediction in static images while we extend our network to videos.

Contribution

i) a novel application of a microcamera in computer-aided dental intervention for continuous tooth macro-visualization during drilling (居然是硬件创新)(悻悻离去)

(ii)    a new, asymmetrically annotated dataset of natural teeth in phantom scenes with pairs of frames of compromised and good quality using a beam splitter,

(iii)  a novel deep network for video processing that propagates task outputs to encoder and decoder across multiple scales to model task interactions, and (iv) demonstration that an instantiated model e˙ectively addresses multi-task video enhancement in our application by matching and surpassing state-of-the-art re-sults of single task networks in near real-time.

Method

通过不同任务间的交互来增强视频的处理效果

视频增强任务是相互关联的。比如:
--对齐视频帧(aligning video frames)有助于去模糊(deblurring)。
--去噪(denoising)和去模糊可以揭示有助于运动估计(motion estimation)的图像特征。
这种相互依赖性可以通过设计一个多任务模型来充分利用。

MOST-Net 是一种多输出、多尺度、多任务的网络架构。它的目标是通过编码器和解码器之间的多尺度特性建模任务间的交互。网络的输出包括多个任务(用 T 表示),这些任务在不同尺度(用 s 表示)上都有输出。例如

传播方式:

  • 尺度内传播:任务的输出会在当前尺度内传播。
  • 跨尺度传播:任务输出会从较低的尺度上采样(upsample),然后传播到较高尺度的解码器层和任务分支中。

约束条件:

ui denotes some operator, for instance, the upsampling operator for seg-mentation or the scaling operator for homography estimation.

Problem Statement

模型需要同时解决视频恢复、牙齿分割和运动估计任务,并在一个退化图像生成模型的假设下进行学习和优化。

T = 3 and O1: video restoration, O2:segmentation , O3: homography esti-mation. 

video stream generates observations , where t is the time index and P > 0 is a scalar value referring to the number of past frames.

The problem is to 1. estimate a clean frame, 2. a binary teeth segmentation mask and 3. approximate the inter-frame motion by a homography matrix, denoted by the triplet (三个任务的联合输出在尺度 s=1上表示为一个三元组↓)



Let x correspond to pixel location. Given per-pixel blur kernels kx,t of size K, the degraded image(为了模拟输入视频的退化过程(如模糊和噪声)) at s = 1 is generated as:

We assume multiple independently moving objects present in the considered scenes, while our task is to estimate only the motion related to the object of interest (i.e. teeth), which is present in the region indicated by non-zero values of mask M:

∀t ∀x 是指所有t和x

Training***

在多任务和多尺度的深度学习模型中定义损失函数和优化目标

数据集

Loss Function

 

需要对 N(样本数)、T(任务数)和 S(尺度数)进行总共 N * T * S 次求和操作。

损失函数类型

模型通过最小化总损失函数来学习参数 Θ,以便同时优化所有任务和所有尺度下的输出预测。优化过程需要考虑不同任务之间的相互关系和尺度之间的协同作用(多任务多尺度学习的核心思想)。

感觉这个multi task learning这块还是有点没搞清楚,我再看看别的论文

Structure

MOST-Net enables refinement of lower scale segmentations by upsampling and inputting them at the task-specific branches of higher scales.

Encoders

MOST-Net extracts features from two input frames Bt−1 and Bt independently at three scales.也就是说,模型同时在多个尺度上处理输入数据。

U-shaped Downsampling : features are extracted via 3 × 3 convolutions with strides of 1, 2, 2 for s = 1, 2, 3 followed by ReLU activations and 5 residual blocks [4] at each scale. The residual connections are augmented with an additional branch of convolutions in the Fast Fourier domain.
output channel dimension :2^(s+4)

At each scale, featuresandare concatenated and a channel attention mechanism follows [30] to fuse them into

MOST-Net uses homography outputs from lower scales to warp encoder features from the previous time step as

Decoders

encoder featuresare passed onto the expanding blocks scale-wisely via the skipping connections.

At the lower scale (s = 3),are directly passed on a stack of two residual blocks with 128 output channels. transposed convolutions with strides of 2 are used twice to recover the resolution scale.

At higher scales (s < 3), featuresare first concatenated with the upsampled decoder features and convolved by 3X 3 kernels to halve the number of channels.(为啥要减半?)Subsequently, they are propagated onto two residual blocks with 64 and 32 output channels each. The residual block outputs constitute scale-specific shared backbones. Lightweight task-specific branches follow to estimate the dense outputs. Specifically, one 3×3 convolution estimates  and two 3 × 3 convolutions, separated by ReLU, yieldat each scale

At each scale, homography estimation modules estimate 4 offsets(偏移量), related 1-1 to homographies via the Direct Linear Transformation (DLT) as in [5,12].  The motion gated attention modules multiply featureswith segmentationsto filter out context irrelevant to the motion of the teeth.The channel dimensionality is then halved by a 3 × 3 convolution while a second one extracts features from the restored output. The concatenation of the two streams forms features 

Homography Estimation Module: At each scale,  and are employed to predict the offsets with shallow downstream networks. Predicted offsets at lower scales are transformed back to homographies and cascaded(串联) bottom-up [12] to refine the higher scale ones.

Similarly to [5], we use blocks of 3 × 3 convolutions coupled with ReLU, batch normalization and max-pooling to reduce the spatial size of the features. Before the regression layer, a 0.2 dropout is applied.or s = 1, the convolution output channels are 64, 128, 256, 256 and 256. For s=2,3 the network depth is cropped from the second and third layers onwards respectively.

Task-Specific Branches

这段是自己根据gpt加的,以前没弄过多任务学习,方便理解*

Each task (colorization, motion estimation, segmentation) is handled by separate branches of the network. These branches can be seen in the image as the paths where F1,F2,F3 (the features at different scales) are passed through different processing stages (e.g., motion gated attention, channel attention, homography estimation) to produce task-specific outputs, such as the colorized frame Rt, mask Mt, and flow Ht​.

The network is optimized for multiple tasks by using shared features across different task-specific branches, while each branch focuses on a particular task's output (colorization, segmentation, motion estimation).The losses corresponding to each task are computed separately and combined in the final objective function, which allows the model to simultaneously learn multiple tasks while sharing common feature representations.

Experiment

Dataset

Vident-lab: a dataset for multi-task video processing of phantom dental scenes - Open Research Data - Bridge of Knowledge

  • Frame-to-Frame (F2F) Training:

    • The model is trained using static video fragments recorded with a camera (C1). The goal is to apply a trained image denoiser to clean noisy frames, obtain denoised frames and and their noise maps
  • Denoising Process:

    • The noisy frames are first denoised using the trained model. Then, these denoised frames are temporally interpolated (using 17 frames) to generate a blurry effect. The temporal interpolation helps in simulating realistic motion blur.
  • Adding Noise:

    • After the blur effect, noise maps are added to the blurry frames(The denoised frames are tem-porally interpolated [19] 8 times and averaged over a temporal window of 17 frames to synthesize real-istic blur) to form the input video frames (B). The noise maps represent the original noise that would have been present in the actual noisy frames.
  • Colorization: registration of frames between two di˙erent modalities C1 and C2

    • To generate output video frames (R), frames from camera C1 are colorized using a process where frames from a second camera (C2) are mapped to create the ground truth frames.
    • Specifically, the frames from C1 are colorized based on data from C2 to form the colorized video frames. This helps in overcoming the difficulty of aligning the frames between the two cameras and creating accurate pixel-to-pixel correspondences.
  • Color Mapping Network:

    • A color mapping (CM) network is learned to predict parameters that map 3D functions from the dental scene colors of camera C2 to the camera C1. This network helps achieve precise color mapping and ensures accurate spatial correspondence between frames B and R.

Segmentation masks and homographies 单应性

HRNet48 [22] pretrained on ImageNet, is fine-tuned on our annotations to automatically segment the teeth in the remaining frames in all three sets. We compute optical flows between consecutive clean frames with RAFT [23]. Motion fields are cropped with teeth masks Mt to discard other moving objects, such as the dental bur or the suction tube, as we are interested in stabilizing the videos with respect to the teeth. Subsequently, a partial aÿne homography H is fitted by RANSAC to the segmented motion field.

Setup

We train, validate, and test all methods on our dataset (Tab. 1). In all MOST-Net training runs, we set λ1, λ2, λ3 to 2 × 10−4, 5 × 10−5 and 1 for balancing tasks in Eq. 4.

augmented by horizontal and vertical flips with 0.5 probability, random channel perturbations, and color jittering, after [31]. 

batch size 16 , Adam , Learning rate 1e − 4, decayed to 1e − 6 with cosine annealing

PyTorch 1.10 (FP32). The inference speed is reported in frames-per-second (FPS) on GPU NVidia RTX 5000.

Results

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/957703.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

一部手机如何配置内网电脑同时访问内外网

做过运维的朋友都知道&#xff0c;最麻烦的是运维电脑不能远程&#xff0c;每次都得现场进行维护&#xff0c;明明客户那边有可以访问内网的电脑&#xff0c;怎么操作能将这台电脑能访问跟到外网呢&#xff0c;这样不就能通过远程软件远程了吗&#xff1f;嘿嘿。按以下步骤试试…

基于STM32的智能门锁安防系统(开源)

目录 项目演示 项目概述 硬件组成&#xff1a; 功能实现 1. 开锁模式 1.1 按键密码开锁 1.2 门禁卡开锁 1.3 指纹开锁 2. 功能备注 3. 硬件模块工作流程 3.1 步进电机控制 3.2 蜂鸣器提示 3.3 OLED显示 3.4 指纹与卡片管理 项目源代码分析 1. 主程序流程 (main…

(三)线性代数之二阶和三阶行列式详解

在前端开发中&#xff0c;尤其是在WebGL、图形渲染、或是与地图、模型计算相关的应用场景里&#xff0c;行列式的概念常常在计算变换矩阵、进行坐标变换或进行图形学算法时被使用。理解二阶和三阶行列式对于理解矩阵运算、旋转、平移等操作至关重要。下面&#xff0c;我将结合具…

基于GRU实现股价多变量时间序列预测(PyTorch版)

前言 系列专栏:【深度学习:算法项目实战】✨︎ 涉及医疗健康、财经金融、商业零售、食品饮料、运动健身、交通运输、环境科学、社交媒体以及文本和图像处理等诸多领域,讨论了各种复杂的深度神经网络思想,如卷积神经网络、循环神经网络、生成对抗网络、门控循环单元、长短期记…

【EdgeAI实战】(1)STM32 边缘 AI 生态系统

【EdgeAI实战】&#xff08;1&#xff09;STM32 边缘 AI 生态系统 【EdgeAI实战】&#xff08;1&#xff09;STM32 边缘 AI 生态系统 1. STM32 边缘人工智能1.1 X-CUBE-AI 扩展包1.2 STM32 AI Model Zoo1.3 ST AIoT Craft 2. STM32N6 AI 生态系统 (STM32N6-AI)2.1 STM32N6 AI 产…

DeepSeek-R1性能如何?如何使用DeepSeek-R1和o1 Pro模型

我们一起来看看DeepSeek-R1模型和OpenAI o1模型的能力如何&#xff1f;接下来&#xff0c;我们先看数据结果&#xff0c;然后再实际体验&#xff0c;我们今天就让他们写个python爬虫脚本来爬取所有有关孙颖莎和樊振东的相关报道和图片。 DeepSeek-R1 DeepSeek介绍自己说 &quo…

FunASR语言识别的环境安装、推理

目录 一、环境配置 1、创建虚拟环境 2、安装环境及pytorch 官网&#xff1a;pytorch下载地址 3、安装funasr之前&#xff0c;确保已经安装了下面依赖环境: python代码调用&#xff08;推荐&#xff09; 4、模型下载 5、启动funasr服务 二、 客户端连接 2.1 html连接 …

【Elasticsearch】 Ingest Pipeline `processors`属性详解

在Elasticsearch中&#xff0c;Ingest Pipeline 的 processors 属性是一个数组&#xff0c;包含一个或多个处理器&#xff08;processors&#xff09;。每个处理器定义了一个数据处理步骤&#xff0c;可以在数据索引之前对数据进行预处理或富化。以下是对 processors 属性中常见…

架构思考与实践:从通用到场景的转变

在当今复杂多变的商业环境中&#xff0c;企业架构的设计与优化成为了一个关键议题。本文通过一系列随笔&#xff0c;探讨了业务架构的价值、从通用架构到场景架构的转变、恰如其分的架构设计以及如何避免盲目低效等问题。通过对多个实际案例的分析&#xff0c;笔者揭示了架构设…

消息队列实战指南:三大MQ 与 Kafka 适用场景全解析

前言&#xff1a;在当今数字化时代&#xff0c;分布式系统和大数据处理变得愈发普遍&#xff0c;消息队列作为其中的关键组件&#xff0c;承担着系统解耦、异步通信、流量削峰等重要职责。ActiveMQ、RabbitMQ、RocketMQ 和 Kafka 作为市场上极具代表性的消息队列产品&#xff0…

win32汇编环境,怎么得到磁盘的盘符

;运行效果 ;win32汇编环境,怎么得到磁盘的盘符 ;以下代码主要为了展示一下原理&#xff0c;应用GetLogicalDrives、GetLogicalDriveStrings函数、屏蔽某些二进制位、按双字节复制内容等。以下代码最多查8个盘&#xff0c;即返回值中的1个字节的信息 ;直接抄进RadAsm可编译运行。…

微软预测 AI 2025,AI Agents 重塑工作形式

1月初&#xff0c;微软在官网发布了2025年6大AI预测&#xff0c;分别是&#xff1a;AI模型将变得更加强大和有用、AI Agents将彻底改变工作方式、AI伴侣将支持日常生活、AI资源的利用将更高效、测试与定制是开发AI的关键以及AI将加速科学研究突破。 值得一提的是&#xff0c;微…

网络编程套接字(二)

目录 TCP网络程序 服务端初始化 创建套接字 服务端绑定 服务端监听 服务端启动 服务端获取连接 服务端处理请求 客户端初始化 客户端启动 发起连接 发起请求 网络测试 多进程版TCP网络程序 捕捉SIGCHLD信号 孙子进程提供服务 多线程版TCP网络程序 线程池版TC…

网站HTTP改成HTTPS

您不仅需要知道如何将HTTP转换为HTTPS&#xff0c;还必须在不妨碍您的网站自成立以来建立的任何搜索排名权限的情况下进行切换。 为什么应该从HTTP转换为HTTPS&#xff1f; 与非安全HTTP于不同&#xff0c;安全域使用SSL&#xff08;安全套接字层&#xff09;服务器上的加密代…

渗透测试--攻击常见的Web应用

本文章咱主要讨论&#xff0c;常见Web应用的攻击手法&#xff0c;其中并不完全&#xff0c;因为Web应用是在太多无法囊括全部&#xff0c;但其中的手法思想却值得我们借鉴&#xff0c;所以俺在此做了记录&#xff0c;希望对大家有帮助&#xff01;主要有以下内容&#xff1a; 1…

外包公司名单一览表(成都)

大家好&#xff0c;我是苍何。 之前写了一篇武汉的外包公司名单&#xff0c;评论区做了个简单统计&#xff0c;很多人说&#xff0c;在外包的日子很煎熬&#xff0c;不再想去了。 有小伙伴留言说有些外包会强制离职&#xff0c;不行就转岗&#xff0c;让人极度没有安全感。 这…

2025 最新flutter面试总结

目录 1.Dart是值传递还是引用传递&#xff1f; 2.Flutter 是单引擎还是双引擎 3. StatelessWidget 和 StatefulWidget 在 Flutter 中有什么区别&#xff1f; 4.简述Dart语音特性 5. Navigator 是什么&#xff1f;在 Flutter 中 Routes 是什么&#xff1f; 6、Dart 是不是…

Spring Boot安全加固:基于Spring Security的权限管理

引言 在当今数字化时代&#xff0c;随着企业信息化程度的不断提高&#xff0c;应用程序的安全性成为了一个至关重要的问题。Spring Boot 作为 Java 生态系统中广泛使用的开发框架&#xff0c;以其简洁、高效的特点深受开发者的喜爱。然而&#xff0c;仅仅依靠 Spring Boot 的默…

论文笔记(六十二)Diffusion Reward Learning Rewards via Conditional Video Diffusion

Diffusion Reward Learning Rewards via Conditional Video Diffusion 文章概括摘要1 引言2 相关工作3 前言4 方法4.1 基于扩散模型的专家视频建模4.2 条件熵作为奖励4.3 训练细节 5 实验5.1 实验设置5.2 主要结果5.3 零样本奖励泛化5.4 真实机器人评估5.5 消融研究 6 结论 文章…

工业缺陷检测实战——基于深度学习YOLOv10神经网络PCB缺陷检测系统

基于深度学习YOLOv10神经网络PCB缺陷检测系统&#xff0c;其能识别六种PCB缺陷&#xff1a;names {0:missing_hole, 1:mouse_bite, 2:open_circuit, 3:short, 4:spur, 5:spurious_copper} CH_names [缺失孔,鼠标咬伤,开路,短路,杂散,伪铜] 具体图片见如下&#xff1a; 第一步…