【RDMA】 ZTR（Zero Touch RoCE）技术（无需配置PFC和ECN)

什么是Zero Touch RoCE（ZTR）

硬件和软件需求

使用方式

实现机制

ZTR-RTTCC 的工作原理

ZTR -RTTCC性能

官方文档

什么是Zero Touch RoCE（ZTR）

Zero Touch RoCE（ZTR）技术是NVIDIA开发的一种创新技术，它允许在不需要特殊交换机配置的情况下，部署RDMA（RoCE，RDMA over Converged Ethernet）。ZTR技术简化了RoCE网络的部署和管理，提高了数据中心的效率和灵活性。

“Zero Touch”的含义可以解释为“零接触”或“无需人工干预”。具体来说，它代表了一种简化配置和管理流程的技术理念，使得用户无需手动进行复杂的网络配置或调整，即可实现RoCE（RDMA over Converged Ethernet）网络的部署和运行。

ZTR技术的主要特点包括：

无需配置：传统的RoCE部署需要配置交换机的显式拥塞通知（ECN）和优先级流控（PFC）等功能。而ZTR技术无需这些复杂的配置，大大简化了部署流程。
无缝集成：ZTR技术允许RoCE流量与非RoCE流量在同一个TCP/IP环境中并行运行，无需对现有网络架构进行大的改动。这使得数据中心可以在不中断现有业务的情况下，逐步引入RoCE技术。
高性能：ZTR技术通过NVIDIA开发的往返时间拥塞控制（RTTCC）算法，主动监控和适应网络拥塞，确保RoCE网络的高性能运行。RTTCC算法使用基于硬件的反馈环路实现动态拥塞控制，与基于软件的拥塞控制算法相比，提供了显著优越的性能。
可扩展性：ZTR技术使得RoCE网络可以轻松扩展到数千台服务器，满足大型数据中心和云计算平台的需求。通过添加RTTCC算法，ZTR技术变得更加健壮和可扩展，无需依赖丢包来通知服务器网络拥塞。

硬件和软件需求

NVIDIA/mellanox ConnectX CX6以上系列智能网卡
软件配置，即可启用ZTR功能

使用方式

（摘自：https://blog.csdn.net/essencelite/article/details/137212816）

【开关】启用可编程拥塞控制：

配置网络接口卡，启用ZTR-RTTCC拥塞控制算法：

mlxconfig -d /dev/mst/mt4125_pciconf0 -y s ROCE_CC_LEGACY_DCQCN=0

将ROCE_CC_LEGACY_DCQCN设置为0，从而启用ZTR-RTTCC算法。

【重启】重置设备或重启主机：

在更改配置后，重置网络设备或者重启主机以使更改生效。例如：

mlxfwreset -d /dev/mst/mt4125_pciconf0 -l 3 -y r

使用ZTR-RTTCC：

完成上述步骤后，当使用RDMA-CM（RoCE CM）进行连接建立时，将自动使用ZTR-RTTCC。

强制使用ZTR-RTTCC：

如果需要，可以强制使用ZTR-RTTCC，即使RDMA-CM尚未同步状态。通过mlxreg命令来实现。

mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len 0x40 --set "0x0.0:8=2,0x4.0:4=15" -y

直接配置寄存器

实现机制

使用控制算法ZTR-RTT CC主动监控网络往返时间（RTT），在丢包之前主动检测和适应拥塞的发生。

ZTR-RTT CC算法介绍见后文。

ZTR-RTTCC 的工作原理

参考：Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control | NVIDIA Technical Blog

ZTR-RTTCC 通过基于硬件 RTT 的拥塞控制算法扩展了 RoCE 网络中的 DCQCN 。

ZTR-RTT CC利用基于硬件的反馈环路实现动态拥塞控制，与基于软件的拥塞控制算法相比，提供了显著优越的性能。

ZTR-RTT CC算法的主要特点包括：

在数据路径加速器（DPA）上实现
基于RTT的拥塞控制
当前RoCE的默认拥塞控制算法
在高性能计算（HPC）和人工智能（AI）工作负载上表现出比数据中心量化拥塞控制（DCQCN）更好的性能
在存储工作负载上保持与DCQCN相当的良好性能

ZTR-RTTCC 通过基于硬件 RTT 的拥塞控制算法扩展了 RoCE 网络中的 DCQCN 。

Timing packets (green network packets in the preceding figure) are periodically sent from the initiator to the target. The timing packets are immediately returned, enabling measurement of round-trip latency. RTTCC measures the time interval between when the packet was sent and when the initiator received it. The difference (Time Received – Time Sent) measures round-trip latency which indicates path congestion. Uncongested flows continue to transmit packets to utilize the available network path bandwidth best. Flows showing increasing latency imply path congestion, for which RTTCC throttles traffic to avoid buffer overflow and packet drops.

定时数据包（上图中的绿色网络数据包）由发起方定期发送给目标方，到达后会立即被返回，从而测量往返延迟。RTTCC（实时传输控制协议拥塞控制）计算出往返延迟（接收时间 - 发送时间），它反映了路径拥塞情况。未拥塞的流会继续传输数据包，以最佳方式利用可用网络路径带宽。往返延迟增加的流意味着路径拥塞，RTTCC会限制流量，以避免缓冲区溢出和数据包丢失。

Network traffic can be adjusted either up or down in real-time as congestion decreases or increases. The ability to actively monitor and react to congestion is critical to enabling ZTR to manage congestion proactively. This proactive rate control also results in reduced packet re-transmission and improved RoCE performance. With ZTR-RTTCC, data center nodes do not wait to be notified of packet loss; instead, they actively identify congestion prior to packet loss and react accordingly, notifying initiators to adjust transmission rates.

随着拥塞的减少或增加，网络流量可以实时上调或下调。ZTR主动监控并对拥塞做出反应的能力（主动速率控制）减少了数据包的重传，并提高了RoCE（RDMA over Converged Ethernet，基于融合以太网的远程直接内存访问）性能。借助ZTR-RTTCC，数据中心节点无需等待通知数据包丢失；相反，它们会在数据包丢失之前主动识别拥塞并据此做出反应，通知发起方调整传输速率。

As noted earlier, one of the key benefits of ZTR is the ability to provide RoCE functionality while operating simultaneously with non-RoCE communications in ordinary TCP/IP traffic. ZTR provides seamless deployment of RoCE network capabilities. With the addition of RTTCC actively monitoring congestion, ZTR provides data center-wide operation without switch configuration. Read on to see how it performs.

如前所述，ZTR的关键优势之一是在提供RoCE功能的同时，能够与普通TCP/IP流量中的非RoCE通信同时运行。ZTR实现了RoCE网络功能的无缝部署。随着RTTCC主动监控拥塞的加入，ZTR无需交换机配置即可实现数据中心范围内的操作。继续阅读以了解其性能表现。

ZTR -RTTCC性能

如图 2 所示，ZTR -RTTCC性能与开启了PFC和ECN的RoCE 性能相当。这些测试是在最坏的多对一（ in-cast ）情况下进行的，以模拟拥塞条件下的吞吐量。

结果表明，ZTR -RTTCC不仅可以扩展到数千个节点，而且其性能与目前可用的最快 RoCE 解决方案相当。

在小规模（ 256 个连接及以下）下，ZTR -RTTCC的吞吐量达到了启用 ECN 拥塞控制的 RoCE （常规RoCE）的 99%
在超过16,000个连接的情况下，ZTR -RTTCC的吞吐量达到了常规RoCE吞吐量的98%。

带有 RTTCC 的 ZTR 在不需要任何开关配置的情况下，提供了与传统 RoCE 几乎相同的性能。

官方文档

Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control | NVIDIA Technical Blog

ZTR-RTT Congestion Control Algorithm Overview v1.0 - NVIDIA Docs

Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control

NVIDIA Zero Touch RoCE (ZTR) enables data centers to seamlessly deploy RDMA over Converged Ethernet (RoCE) without requiring any special switch configuration. Until recently, ZTR was optimal for only small to medium-sized data centers. Meanwhile, large-scale deployments have traditionally relied on Explicit Congestion Notification (ECN) to enable RoCE network transport, which requires switch configuration.

The new NVIDIA congestion control algorithm—Round-Trip Time Congestion Control (RTTCC)—allows ZTR to scale to thousands of servers without compromising performance. Using ZTR and RTTCC allows data center operators to enjoy ease-of-deployment and operations together with the superb performance of Remote Direct Memory Access (RDMA) at a massive scale, without any switch configuration.

This post describes the previously recommended RoCE congestion control in large and small-scale RoCE deployments. It then introduces a new congestion control algorithm that allows configuration-free, large-scale implementations of ZTR, which perform like ECN-enabled RoCE.

RoCE deployments with Data Center Quantized Congestion Notification

In a typical TCP-based environment, distributed memory requests require many steps and CPU cycles, negatively impacting application performance. RDMA eliminates all CPU involvement in memory data transfers between servers significantly accelerating both access to stored data and application performance.

RoCE provides RDMA in Ethernet environments—the primary network fabric in data centers. Ethernet requires an advanced congestion control mechanism to support RDMA network transports. Data Center Quantized Congestion Notification (DCQCN) is a congestion control algorithm that enables responding to congestion notifications and dynamically adjusting traffic transmit rates.

The implementation of DCQCN requires enabling Explicit Congestion Notification (ECN), which entails configuring network switches. ECN configures switches to set the Congestion Experienced (CE) bit to indicate the imminent onset of congestion.

Zero touch RoCE—with reactive congestion control

The NVIDIA-developed ZTR technology allows RoCE deployments, which don’t require configuring the switch infrastructure. Built according to the InfiniBand Trade Association (IBTA) RDMA standard and fully compliant with the RoCE specifications, ZTR enables seamless deployment of RoCE. ZTR also boasts performance equivalent to traditional switch-enabled RoCE and is significantly better than traditional TCP-based memory access. Moreover, with ZTR, RoCE network transport services operate side-by-side with non-RoCE communications in ordinary TCP/IP environments.

As noted in the NVIDIA Zero-Touch RoCE Technology Enables Cloud Economics for Microsoft Azure Stack HCI post, Microsoft has validated ZTR for their Azure Stack HCI platform, which typically scales to a few dozen nodes. In such environments, ZTR relies on implicit packet loss notification, which is sufficient for small-scale deployments. Adding a new Round Trip Timer (RTT)-based congestion control algorithm, ZTR becomes even more robust and scalable without relying on packet loss to notify the server of network congestion.

Introducing round-trip time congestion control

The new NVIDIA congestion control algorithm, RTTCC, actively monitors network RTT to proactively detect and adapt to the onset of congestion before dropping packets. RTTCC enables dynamic congestion control using a hardware-based feedback loop that provides dramatically superior performance compared to software-based congestion control algorithms. RTTCC also supports faster transmission rates and can deploy ZTR at a larger scale. ZTR with RTTCC is now available as a beta feature, with GA planned for the second half of 2022.

How ZTR-RTTCC works

ZTR-RTTCC extends DCQCN in RoCE networks with a hardware RTT-based congestion control algorithm.

Figure 1. Round trip timing between servers

Network traffic can be adjusted either up or down in real-time as congestion decreases or increases. The ability to actively monitor and react to congestion is critical to enabling ZTR to manage congestion proactively. This proactive rate control also results in reduced packet re-transmission and improved RoCE performance. With ZTR-RTTCC, data center nodes do not wait to be notified of packet loss; instead, they actively identify congestion prior to packet loss and react accordingly, notifying initiators to adjust transmission rates.

ZTR with RTTCC performance

As shown in Figure 2, ZTR with RTTCC provides application performance comparable to RoCE when ECN and PFC are configured across the network fabric. These tests were performed under worst case many-to-one (in cast) scenarios to simulate the throughput under congested conditions.

The results indicate that not only does ZTR with RTTCC scale to thousands of nodes, but it also performs comparably to the fastest RoCE solution currently available.

At small scale (256 connections and below), ZTR with RTTCC performs within 99% of RoCE with ECN congestion control enabled (conventional RoCE).
With over 16,000 connections, ZTR with RTTCC throughput is 98% of conventional RoCE throughput.

ZTR with RTTCC provides near-equivalent performance to conventional RoCE without requiring any switch configuration.

Figure 2. Application bandwidth with increasing connections

Configuring ZTR

To configure ZTR with the new RTTCC algorithm, download and install the latest firmware and tools for your NVIDIA network interface card and perform the following steps.

Enable programmable congestion control using mlxconfig (persistent configuration):

mlxconfig -d /dev/mst/mt4125_pciconf0 -y s
ROCE_CC_LEGACY_DCQCN=0

Reset the device using mlxfwreset or reboot the host:

mlxfwreset -d /dev/mst/mt4125_pciconf0 -l 3 -y r

When you complete these steps, ZTR-RTTCC is used when RDMA-CM is used with Enhanced Connection Establishment (ECE, supported with MLNX_OFED version 5.1).

If there’s an error, you can force ZTR-RTTCC usage regardless of RDMA-CM synchronization status:

mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len
0x40 --set "0x0.0:8=2,0x4.0:4=15" -y

Summary

NVIDIA RTTCC, the new congestion control algorithm for ZTR, delivers superb RoCE performance at data center scale, without any special configuration of the switch infrastructure. This enhancement allows data centers to enable RoCE seamlessly in both existing and new data center infrastructure and benefit from immediate application performance improvements.

We encourage you to test ZTR with RTTCC for your application use cases by downloading the latest NVIDIA software.