【RDMA】 ZTR(Zero Touch RoCE)技术(无需配置PFC和ECN)

目录

什么是Zero Touch RoCE(ZTR)

硬件和软件需求

使用方式

实现机制

ZTR-RTTCC 的工作原理

ZTR -RTTCC性能

官方文档


什么是Zero Touch RoCE(ZTR)

Zero Touch RoCE(ZTR)技术是NVIDIA开发的一种创新技术,它允许在不需要特殊交换机配置的情况下,部署RDMA(RoCE,RDMA over Converged Ethernet)。ZTR技术简化了RoCE网络的部署和管理,提高了数据中心的效率和灵活性。

“Zero Touch”的含义可以解释为“零接触”或“无需人工干预”。具体来说,它代表了一种简化配置和管理流程的技术理念,使得用户无需手动进行复杂的网络配置或调整,即可实现RoCE(RDMA over Converged Ethernet)网络的部署和运行。

ZTR技术的主要特点包括

  1. 无需配置:传统的RoCE部署需要配置交换机的显式拥塞通知(ECN)和优先级流控(PFC)等功能。而ZTR技术无需这些复杂的配置,大大简化了部署流程。

  2. 无缝集成:ZTR技术允许RoCE流量与非RoCE流量在同一个TCP/IP环境中并行运行,无需对现有网络架构进行大的改动。这使得数据中心可以在不中断现有业务的情况下,逐步引入RoCE技术。

  3. 高性能:ZTR技术通过NVIDIA开发的往返时间拥塞控制(RTTCC)算法,主动监控和适应网络拥塞,确保RoCE网络的高性能运行。RTTCC算法使用基于硬件的反馈环路实现动态拥塞控制,与基于软件的拥塞控制算法相比,提供了显著优越的性能。

  4. 可扩展性:ZTR技术使得RoCE网络可以轻松扩展到数千台服务器,满足大型数据中心和云计算平台的需求。通过添加RTTCC算法,ZTR技术变得更加健壮和可扩展,无需依赖丢包来通知服务器网络拥塞。

硬件和软件需求

  • NVIDIA/mellanox ConnectX  CX6以上系列智能网卡
  • 软件配置,即可启用ZTR功能

使用方式

(摘自:https://blog.csdn.net/essencelite/article/details/137212816)

【开关】启用可编程拥塞控制:

配置网络接口卡,启用ZTR-RTTCC拥塞控制算法:

mlxconfig -d /dev/mst/mt4125_pciconf0 -y s ROCE_CC_LEGACY_DCQCN=0

将ROCE_CC_LEGACY_DCQCN设置为0,从而启用ZTR-RTTCC算法。

【重启】重置设备或重启主机:

在更改配置后,重置网络设备或者重启主机以使更改生效。例如:

mlxfwreset -d /dev/mst/mt4125_pciconf0 -l 3 -y r

使用ZTR-RTTCC:

完成上述步骤后,当使用RDMA-CM(RoCE CM)进行连接建立时,将自动使用ZTR-RTTCC。

强制使用ZTR-RTTCC:

如果需要,可以强制使用ZTR-RTTCC,即使RDMA-CM尚未同步状态。通过mlxreg命令来实现。

mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len 0x40 --set "0x0.0:8=2,0x4.0:4=15" -y

直接配置寄存器

实现机制

使用 控制算法ZTR-RTT CC主动监控网络往返时间(RTT),在丢包之前主动检测和适应拥塞的发生。

ZTR-RTT CC算法介绍见后文。

ZTR-RTTCC 的工作原理

参考:Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control | NVIDIA Technical Blog

ZTR-RTTCC 通过基于硬件 RTT 的拥塞控制算法扩展了 RoCE 网络中的 DCQCN 。

ZTR-RTT CC利用基于硬件的反馈环路实现动态拥塞控制,与基于软件的拥塞控制算法相比,提供了显著优越的性能。

ZTR-RTT CC算法的主要特点包括:

  • 在数据路径加速器(DPA)上实现
  • 基于RTT的拥塞控制
  • 当前RoCE的默认拥塞控制算法
  • 在高性能计算(HPC)和人工智能(AI)工作负载上表现出比数据中心量化拥塞控制(DCQCN)更好的性能
  • 在存储工作负载上保持与DCQCN相当的良好性能

ZTR-RTTCC 通过基于硬件 RTT 的拥塞控制算法扩展了 RoCE 网络中的 DCQCN 。

Timing packets (green network packets in the preceding figure) are periodically sent from the initiator to the target. The timing packets are immediately returned, enabling measurement of round-trip latency. RTTCC measures the time interval between when the packet was sent and when the initiator received it. The difference (Time Received – Time Sent) measures round-trip latency which indicates path congestion. Uncongested flows continue to transmit packets to utilize the available network path bandwidth best. Flows showing increasing latency imply path congestion, for which RTTCC throttles traffic to avoid buffer overflow and packet drops.

定时数据包(上图中的绿色网络数据包)由发起方定期发送给目标方,到达后会立即被返回,从而测量往返延迟。RTTCC(实时传输控制协议拥塞控制)计算出往返延迟(接收时间 - 发送时间),它反映了路径拥塞情况。未拥塞的流会继续传输数据包,以最佳方式利用可用网络路径带宽。往返延迟增加的流意味着路径拥塞,RTTCC会限制流量,以避免缓冲区溢出和数据包丢失。

Network traffic can be adjusted either up or down in real-time as congestion decreases or increases. The ability to actively monitor and react to congestion is critical to enabling ZTR to manage congestion proactively. This proactive rate control also results in reduced packet re-transmission and improved RoCE performance. With ZTR-RTTCC, data center nodes do not wait to be notified of packet loss; instead, they actively identify congestion prior to packet loss and react accordingly, notifying initiators to adjust transmission rates.

随着拥塞的减少或增加,网络流量可以实时上调或下调。ZTR主动监控并对拥塞做出反应的能力(主动速率控制)减少了数据包的重传,并提高了RoCE(RDMA over Converged Ethernet,基于融合以太网的远程直接内存访问)性能。借助ZTR-RTTCC,数据中心节点无需等待通知数据包丢失;相反,它们会在数据包丢失之前主动识别拥塞并据此做出反应,通知发起方调整传输速率。

As noted earlier, one of the key benefits of ZTR is the ability to provide RoCE functionality while operating simultaneously with non-RoCE communications in ordinary TCP/IP traffic. ZTR provides seamless deployment of RoCE network capabilities. With the addition of RTTCC actively monitoring congestion, ZTR provides data center-wide operation without switch configuration. Read on to see how it performs.

如前所述,ZTR的关键优势之一是在提供RoCE功能的同时,能够与普通TCP/IP流量中的非RoCE通信同时运行。ZTR实现了RoCE网络功能的无缝部署。随着RTTCC主动监控拥塞的加入,ZTR无需交换机配置即可实现数据中心范围内的操作。继续阅读以了解其性能表现。

ZTR -RTTCC性能

如图 2 所示,ZTR -RTTCC性能与开启了PFC和ECN的RoCE 性能相当。这些测试是在最坏的多对一( in-cast )情况下进行的,以模拟拥塞条件下的吞吐量。

结果表明,ZTR -RTTCC不仅可以扩展到数千个节点,而且其性能与目前可用的最快 RoCE 解决方案相当。

  • 在小规模( 256 个连接及以下)下,ZTR -RTTCC的吞吐量达到了启用 ECN 拥塞控制的 RoCE (常规RoCE)的 99% 
  • 在超过16,000个连接的情况下,ZTR -RTTCC的吞吐量达到了常规RoCE吞吐量的98%。

带有 RTTCC 的 ZTR 在不需要任何开关配置的情况下,提供了与传统 RoCE 几乎相同的性能。

官方文档

Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control | NVIDIA Technical Blog

ZTR-RTT Congestion Control Algorithm Overview v1.0 - NVIDIA Docs

Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control

Zero Touch RoCE enables a smooth data highway

NVIDIA Zero Touch RoCE (ZTR) enables data centers to seamlessly deploy RDMA over Converged Ethernet (RoCE) without requiring any special switch configuration. Until recently, ZTR was optimal for only small to medium-sized data centers. Meanwhile, large-scale deployments have traditionally relied on Explicit Congestion Notification (ECN) to enable RoCE network transport, which requires switch configuration.

The new NVIDIA congestion control algorithm—Round-Trip Time Congestion Control (RTTCC)—allows ZTR to scale to thousands of servers without compromising performance. Using ZTR and RTTCC allows data center operators to enjoy ease-of-deployment and operations together with the superb performance of Remote Direct Memory Access (RDMA) at a massive scale, without any switch configuration. 

This post describes the previously recommended RoCE congestion control in large and small-scale RoCE deployments. It then introduces a new congestion control algorithm that allows configuration-free, large-scale implementations of ZTR, which perform like ECN-enabled RoCE. 

RoCE deployments with Data Center Quantized Congestion Notification

In a typical TCP-based environment, distributed memory requests require many steps and CPU cycles, negatively impacting application performance.  RDMA eliminates all CPU involvement in memory data transfers between servers significantly accelerating both access to stored data and application performance. 

RoCE provides RDMA in Ethernet environments—the primary network fabric in data centers. Ethernet requires an advanced congestion control mechanism to support RDMA network transports. Data Center Quantized Congestion Notification (DCQCN) is a congestion control algorithm that enables responding to congestion notifications and dynamically adjusting traffic transmit rates. 

The implementation of DCQCN requires enabling Explicit Congestion Notification (ECN), which entails configuring network switches. ECN configures switches to set the Congestion Experienced (CE) bit to indicate the imminent onset of congestion. 

Zero touch RoCE—with reactive congestion control 

The NVIDIA-developed ZTR technology allows RoCE deployments, which don’t require configuring the switch infrastructure. Built according to the InfiniBand Trade Association (IBTA) RDMA standard and fully compliant with the RoCE specifications, ZTR enables seamless deployment of RoCE. ZTR also boasts performance equivalent to traditional switch-enabled RoCE and is significantly better than traditional TCP-based memory access. Moreover, with ZTR, RoCE network transport services operate side-by-side with non-RoCE communications in ordinary TCP/IP environments.

As noted in the NVIDIA Zero-Touch RoCE Technology Enables Cloud Economics for Microsoft Azure Stack HCI post, Microsoft has validated ZTR for their Azure Stack HCI platform, which typically scales to a few dozen nodes. In such environments, ZTR relies on implicit packet loss notification, which is sufficient for small-scale deployments. Adding a new Round Trip Timer (RTT)-based congestion control algorithm, ZTR becomes even more robust and scalable without relying on packet loss to notify the server of network congestion.

Introducing round-trip time congestion control

The new NVIDIA congestion control algorithm, RTTCC, actively monitors network RTT to proactively detect and adapt to the onset of congestion before dropping packets. RTTCC enables dynamic congestion control using a hardware-based feedback loop that provides dramatically superior performance compared to software-based congestion control algorithms. RTTCC also supports faster transmission rates and can deploy ZTR at a larger scale. ZTR with RTTCC is now available as a beta feature, with GA planned for the second half of 2022.

How ZTR-RTTCC works

ZTR-RTTCC extends DCQCN in RoCE networks with a hardware RTT-based congestion control algorithm.

Server A (the initiator) sends both payload and timing packets to server B. Timing packets are immediately returned to the initiator, enabling it to measure the round-trip latency.

Figure 1. Round trip timing between servers

Timing packets (green network packets in the preceding figure) are periodically sent from the initiator to the target. The timing packets are immediately returned, enabling measurement of round-trip latency. RTTCC measures the time interval between when the packet was sent and when the initiator received it. The difference (Time Received – Time Sent) measures round-trip latency which indicates path congestion. Uncongested flows continue to transmit packets to utilize the available network path bandwidth best. Flows showing increasing latency imply path congestion, for which RTTCC throttles traffic to avoid buffer overflow and packet drops.

Network traffic can be adjusted either up or down in real-time as congestion decreases or increases. The ability to actively monitor and react to congestion is critical to enabling ZTR to manage congestion proactively. This proactive rate control also results in reduced packet re-transmission and improved RoCE performance. With ZTR-RTTCC, data center nodes do not wait to be notified of packet loss; instead, they actively identify congestion prior to packet loss and react accordingly, notifying initiators to adjust transmission rates.

As noted earlier, one of the key benefits of ZTR is the ability to provide RoCE functionality while operating simultaneously with non-RoCE communications in ordinary TCP/IP traffic. ZTR provides seamless deployment of RoCE network capabilities. With the addition of RTTCC actively monitoring congestion, ZTR provides data center-wide operation without switch configuration. Read on to see how it performs.

ZTR with RTTCC performance

As shown in Figure 2, ZTR with RTTCC provides application performance comparable to RoCE when ECN and PFC are configured across the network fabric. These tests were performed under worst case many-to-one (in cast) scenarios to simulate the throughput under congested conditions. 

The results indicate that not only does ZTR with RTTCC scale to thousands of nodes, but it also performs comparably to the fastest RoCE solution currently available.

  • At small scale (256 connections and below), ZTR with RTTCC performs within 99% of RoCE with ECN congestion control enabled (conventional RoCE).
  • With over 16,000 connections, ZTR with RTTCC throughput is 98% of conventional RoCE throughput.

ZTR with RTTCC provides near-equivalent performance to conventional RoCE without requiring any switch configuration.

A diagram showing comparison of network throughput (Gb/s) for ZTR w/ RTTCC and RoCE w/ DC-QCN (Conventional RoCE)

Figure 2. Application bandwidth with increasing connections

Configuring ZTR

To configure ZTR with the new RTTCC algorithm, download and install the latest firmware and tools for your NVIDIA network interface card and perform the following steps.

Enable programmable congestion control using mlxconfig (persistent configuration):

mlxconfig -d /dev/mst/mt4125_pciconf0 -y s
ROCE_CC_LEGACY_DCQCN=0

Reset the device using mlxfwreset or reboot the host:

mlxfwreset -d /dev/mst/mt4125_pciconf0 -l 3 -y r

When you complete these steps, ZTR-RTTCC is used when RDMA-CM is used with Enhanced Connection Establishment (ECE, supported with MLNX_OFED version 5.1). 

If there’s an error, you can force ZTR-RTTCC usage regardless of RDMA-CM synchronization status:

mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len
0x40 --set "0x0.0:8=2,0x4.0:4=15" -y

Summary

NVIDIA RTTCC, the new congestion control algorithm for ZTR, delivers superb RoCE performance at data center scale, without any special configuration of the switch infrastructure. This enhancement allows data centers to enable RoCE seamlessly in both existing and new data center infrastructure and benefit from immediate application performance improvements. 

We encourage you to test ZTR with RTTCC for your application use cases by downloading the latest NVIDIA software.

Related resources

  • GTC session: Achieving Zero Trust Security and High Efficiency for 5G Networks
  • GTC session: Aerial Omniverse Digital Twin for 6G RAN
  • GTC session: Unlocking the Future of On-Demand VR Streaming
  • SDK: Aerial CUDA Accelerated RAN (previously Aerial SDK; featuring cuBB)
  • SDK: Spectrum Switch SDK
  • SDK: Rivermax

 Discuss (14)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/953537.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

【python】OpenCV—Local Translation Warps

文章目录 1、功能描述2、原理分析3、代码实现4、效果展示5、完整代码6、参考 1、功能描述 利用液化效果实现瘦脸美颜 交互式的液化效果原理来自 Gustafsson A. Interactive image warping[D]. , 1993. 2、原理分析 上面描述很清晰了,鼠标初始在 C,也即…

大疆上云API基于源码部署

文章目录 大疆上云API基于源码部署注意事项1、学习官网2、环境准备注意事项3、注册成为DJI开发者4、下载前后端运行所需要的包/依赖前端依赖下载后端所需要的Maven依赖包 用到的软件可以在这里下载5、MySQL数据库安装安装MySQL启动MySQL服务在IDEA中配置MySQL的连接信息 6、Red…

AI学习路线图-邱锡鹏-神经网络与深度学习

1 需求 神经网络与深度学习 2 接口 3 示例 4 参考资料

行业案例:高德服务单元化方案和架构实践

目录 为什么要做单元化 高德单元化的特点 高德单元化实践 服务单元化架构 就近接入实现方案 路由表设计 路由计算 服务端数据驱动的单元化场景 总结 系列阅读 为什么要做单元化 单机房资源瓶颈 随着业务体量和服务用户群体的增长,单机房或同城双机房无法支持服…

【计算机网络】lab7 TCP协议

🌈 个人主页:十二月的猫-CSDN博客 🔥 系列专栏: 🏀计算机网络_十二月的猫的博客-CSDN博客 💪🏻 十二月的寒冬阻挡不了春天的脚步,十二点的黑夜遮蔽不住黎明的曙光 目录 1. 实验目的…

docker中jenkins流水线式部署GitLab中springboot项目

本质就是将java项目拉取下来,并自动打包成docker镜像,运行 首先启动一个docker的jenkins 如果没有镜像使用我的镜像 通过网盘分享的文件:jenkins.tar 链接: https://pan.baidu.com/s/1VJOMf6RSIQbvW_V1zFD7eQ?pwd6666 提取码: 6666 放入服…

【初识扫盲】厚尾分布

厚尾分布(Fat-tailed distribution)是一种概率分布,其尾部比正态分布更“厚”,即尾部的概率密度更大,极端值出现的概率更高。 一、厚尾分布的特征 尾部概率大 在正态分布中,极端值(如距离均值很…

小程序租赁系统

内容概要 小程序租赁系统,听起来很复杂,但其实就是为那些想要快速搭建业务的人提供一个便捷的工具。随着移动互联网的迅猛发展,越来越多的企业和创业者开始寻找效率和灵活性,而小程序正好满足了这种需求。据统计,过去…

高可用虚拟IP-keepalived

个人觉得华为云这个文档十分详细:使用虚拟IP和Keepalived搭建高可用Web集群_弹性云服务器 ECS_华为云 应用场景:虚拟IP技术。虚拟IP,就是一个未分配给真实主机的IP,也就是说对外提供数据库服务器的主机除了有一个真实IP外还有一个…

工厂人员定位管理系统方案(二)人员精确定位系统架构设计,适用于工厂智能管理

哈喽~这里是维小帮,提供多个场所的定位管理方案,如需获取工厂人员定位管理系统解决方案可前往文章最下方获取,如有项目合作及技术交流欢迎私信我们哦~撒花 在上一篇文章中,我们初步探讨了工厂人员定位管理系统的需求背景以及定位方…

虚假星标:GitHub上的“刷星”乱象与应对之道

在开源软件的世界里,GitHub无疑是最重要的平台之一。它不仅是一个代码托管平台,也是一个社交网络,允许开发者通过“点赞”(即加星)来表达对某个项目的喜爱和支持,“星标”(Star)则成…

RK3568 Android 13 内置搜狗输入法小计

问:为什么写? 答:网上搜出来的都试过了,不行!下面直接上代码和注意事项! 首先到这个目录(/RK3568/Rockchip_Android13_SDK_Release/device/rockchip/rk356x/tl3568_evm/preinstall&#xff09…

GO语言实现KMP算法

前言 本文结合朱战立教授编著的《数据结构—使用c语言(第五版)》(以下简称为《数据结构(第五版)朱站立》)中4.4.2章节内容编写,KMP的相关概念可参考此书4.4.2章节内容。原文中代码是C语言&…

基于springboot的疫情网课管理系统

作者:学姐 开发技术:SpringBoot、SSM、Vue、MySQL、JSP、ElementUI、Python、小程序等 文末获取“源码数据库万字文档PPT”,支持远程部署调试、运行安装。 项目包含: 完整源码数据库功能演示视频万字文档PPT 项目编码&#xff1…

FFmpeg硬件解码

使用FFmpeg进行硬件解码时,通常需要结合FFmpeg的API和硬件加速API(如CUDA、VAAPI、DXVA2等)。以下是一个简单的C代码示例,展示如何使用FFmpeg进行硬件解码。这个示例使用了CUDA作为硬件加速的后端。 1. 安装FFmpeg和CUDA 确保你…

unity如何在urp管线下合并spine的渲染批次

对于导入unity的spine来说,他会对每个spine生成独有的材质,虽然他们使用的是同一个shader,但由于附带独有的贴图,这样在项目使用中会由于材质贴图不同而导致无法合批. 而为什么选用urp,因为在built-in管线中,对于GPU-instancing,即使通过使用图集的方式统一了贴图,也会由于spi…

【Elasticsearch】批量操作:优化性能

🧑 博主简介:CSDN博客专家,历代文学网(PC端可以访问:https://literature.sinhy.com/#/?__c=1000,移动端可微信小程序搜索“历代文学”)总架构师,15年工作经验,精通Java编程,高并发设计,Springboot和微服务,熟悉Linux,ESXI虚拟化以及云原生Docker和K8s,热衷于探…

深入 Flutter 和 Compose 在 UI 渲染刷新时 Diff 实现对比

众所周知,不管是什么框架,在前端 UI 渲染时,都会有构造出一套相关的渲染树,并且在 UI 更新时,为了尽可能提高性能,一般都只会进行「差异化」更新,而不是对整个 UI Tree 进行刷新,所以…

Docker 的安装和基本使用[SpringBoot之Docker实战系列] - 第535篇

历史文章(文章累计530) 《国内最全的Spring Boot系列之一》 《国内最全的Spring Boot系列之二》 《国内最全的Spring Boot系列之三》 《国内最全的Spring Boot系列之四》 《国内最全的Spring Boot系列之五》 《国内最全的Spring Boot系列之六》 《…

介绍下不同语言的异常处理机制

Golang 在Go语言中,有两种用于处于异常的机制,分别是error和panic; panic panic 是 Go 中处理异常情况的机制,用于表示程序遇到了无法恢复的错误,需要终止执行。 使用场景 程序出现严重的不符合预期的问题&#x…