文献阅读（247）AIpa

题目：Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
时间：2022
会议：OSDI
研究机构：UCB

传统的DNN并行策略： 现有的分布式训练系统要么需要用户手动创建并行化计划，要么需要用户从有限的模型并行化配置空间中自动生成并行化计划

数据并行：将模型复制多份，数据集分配到不同的设备上
运算符并行：将某个op沿non-batch axes分配到多个设备上
流水线并行：不同的op或stage分配到不同的设备上，彼此间流水线

在这里插入图片描述

本篇论文的主要贡献： ：本篇论文将分布式训练分成了inter-operator并行和intra-operator并行。

We construct a two-level parallel execution plan space where plans are specified hierarchically using interand intra-operator parallelisms.
We design tractable optimization algorithms to derive near-optimal execution plans at each level.
We implement Alpa, a compiler system for distributed DL on GPU clusters. Alpa features:
(1) a set of compilation passes that generate execution plans using the hierarchical optimization algorithms,
(2) a new runtime architecture that orchestrates the inter-op parallelism between device meshes, and
(3) a number of system optimizations that improve compilation and address cross-mesh communication
intra-operator parallelism: 硬件利用率更高，但每次训练迭代中需要在拆分和合并时进行通信
inter-operator parallelism: 只需要在相邻计算阶段之间需要通信，但数据依赖可能导致设备的空闲时间

在这里插入图片描述

下图中，红色箭头表示慢速连接上的发送/接收，绿色箭头表示快速连接上的全收集。
(a)在Megatron-LM中，针对equal mesh shape的scatter-gather优化
(b)用于unequal mesh shape的发送/接收
©uequal mesh shape上的local all-gather
在这里插入图片描述

题目：On Optimizing the Communication of Model Parallelism
时间：2022
会议：MLsys
研究机构：UCB

Neither intra-op parallelism nor inter-op parallelism alone suffices to train large models. In practice, they must be combined to support large models like GPT-3.
单独的intra-op parallelism或inter-op parallelism都不足以训练大型模型，它们必须结合起来支持大型模型，如GPT-3
This combined strategy is implemented in many model-parallel systems by first partitioning the computational graph using inter-op parallelism then sharding each stage using intra-op parallelism.
这种组合策略在许多模型并行系统使用，首先inter-op parallelism划分计算图，然后intra-op parallelism划分每个stage
Specifically, the graph is first partitioned into multiple stages. Each stage is assigned to a group of devices, referred to as a device mesh, sliced from the cluster.
计算图被分成多个stage，每个stage被分配给一组设备，称为设备网格，
Operators and tensors of a stage are parallelized over that stage’s assigned mesh following a chosen intra-op parallelism plan; collective communication happens only across devices within each mesh.
一个stage的算子和张量按照选择的intra-op parallelism在该stage分配的mesh上并行化，集合通信仅发生在每个mesh内的设备之间。
At the boundary of any two adjacent stages, communication is required to exchange tensors between their meshes.
在任意两个相邻的stage处，需要进行通信以在网格之间交换张量。
Unlike inter-op parallelism, the tensor might have been sharded with different layouts on the source and destination meshes – in which case, communication involves not only transferring the tensor, but also performing tensor layout conversion between the source and destination groups of devices.
与inter-op parallelism不同，张量可能在源和目标网格上以不同的布局进行分割，在这种情况下，通信不仅涉及传输张量，还涉及在源和目标设备组之间执行张量布局转换
We call this communication pattern cross-mesh resharding, which is the focus of this paper.
我们称这种通信模式为跨网格重共享，这是本文的重点。

一般的cross-mesh resharding问题可以分解为多个单元通信任务，每个单元通信任务负责发送一个data slice，我们将原始问题构建为一个两级优化问题:

单个单元通信任务的优化，对此使用广播以达到最佳性能。
cross-mesh resharding中多单元通信任务的负载平衡和调度。

集群设置：节点之间是全连接拓扑，独立的发送/接收带宽（全双工）

在这里插入图片描述
每个cross-mesh resharding包括多个单元通信任务。这些任务可能会在发送方和接收方设备上重叠，并影响彼此的性能。因此，为了优化cross-mesh resharding的总完成时间，我们将该问题视为负载平衡和调度问题，需要：
(1)通过在发送方设备和主机间通信链路上平均分配通信工作负载来平衡负载，以避免拥塞和掉队；
(2)调度分配给一个特定设备的不同任务的顺序，以最小化由于不可用的发送器/接收器而导致的等待