NVIDIA NCCL 源码学习(十三)- IB SHARP

背景

之前我们看到了基于ring和tree的两种allreduce算法,对于ring allreduce,一块数据在reduce scatter阶段需要经过所有的rank,allgather阶段又需要经过所有rank;对于tree allreduce,一块数据数据在reduce阶段要上行直到根节点,在broadcast阶段再下行回来,整个流程比较长。

SHARP

因此基于这一点,mellanox提出了SHARP,将计算offload到了IB switch,每个节点只需要发送一次数据,这块数据会被交换机完成规约,然后每个节点再接收一次就得到了完整结果。
如图1展示了一个胖树物理网络拓扑,其中绿色节点为交换机,黄色节点的为host。

在这里插入图片描述
              图 1



图2展示了基于图1的物理网络拓扑,这些host节点执行SHARP初始化后得到的SHARP tree,SHARP tree是个逻辑概念,并不要求物理拓扑一定是胖树。
在这里插入图片描述
               图 2

值得注意的是:

  • 其中一个交换机可以位于最多64个SHARP tree中,
  • 一个SHARP tree可以建立大量的group,group指的是基于现有的host创建的一个子集
  • 支持数百个集合通信api的并发执行

整体流程

本篇还是基于2.7.8介绍,类似tree allreduce,机内仍然是个chain,以两机为例,数据的发送按照图3箭头的方向运行,交换机计算出完整结果后再按照箭头反方向流动。
在这里插入图片描述

图 3

初始化

运行IB SHARP依赖libnccl-net.so,这个代码开源在https://github.com/Mellanox/nccl-rdma-sharp-plugins,本篇基于v2.0.1介绍。
nccl完成bootstrap网络的初始化后,开始初始化数据网络

ncclResult_t initNetPlugin(ncclNet_t** net, ncclCollNet_t** collnet) {
  void* netPluginLib = dlopen("libnccl-net.so", RTLD_NOW | RTLD_LOCAL);
  ...
  ncclNet_t* extNet = (ncclNet_t*) dlsym(netPluginLib, STR(NCCL_PLUGIN_SYMBOL));
  if (extNet == NULL) {
    INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find " STR(NCCL_PLUGIN_SYMBOL) " symbol.");
  } else if (initNet(extNet) == ncclSuccess) {
    *net = extNet;
    ncclCollNet_t* extCollNet = (ncclCollNet_t*) dlsym(netPluginLib, STR(NCCL_COLLNET_PLUGIN_SYMBOL));
    if (extCollNet == NULL) {
      INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find " STR(NCCL_COLLNET_PLUGIN_SYMBOL) " symbol.");
    } else if (initCollNet(extCollNet) == ncclSuccess) {
      *collnet = extCollNet;
    }    
    return ncclSuccess;
  }
  if (netPluginLib != NULL) dlclose(netPluginLib);
  return ncclSuccess;
}

首先dlopen libnccl-net.so,其中NCCL_PLUGIN_SYMBOL为ncclNetPlugin_v3,NCCL_COLLNET_PLUGIN_SYMBOL为ncclCollNetPlugin_v3,所以就是获取符号ncclNetPlugin_v3赋值给extNet ,获取符号ncclCollNetPlugin_v3赋值给extCollNet。extNet和extCollNet的关系有点像nccl的bootstrap网络和数据网络的关系,extCollNet负责实际数据的通信,元信息,控制信息之类的是通过extNet执行交换的。
然后执行extNet的初始化,extNet支持ib,ucx等多种后端,我们用的是ib,因此之后所有的执行其实都是针对的ibPlugin,ibPlugin的逻辑和nccl的ncclNetIb逻辑很像,初始化即ncclIbInit,就是枚举当前机器所有的网卡,保存到全局数组中。
然后执行initCollNet,即ncclSharpInit,其实也是执行ibPlugin的初始化,之前已经执行过了,所以这里什么也不做。

图搜索

  struct ncclTopoGraph collNetGraph;
  collNetGraph.id = 2; 
  collNetGraph.pattern = NCCL_TOPO_PATTERN_TREE;
  collNetGraph.collNet = 1; 
  collNetGraph.crossNic = ncclParamCrossNic();
  collNetGraph.minChannels = collNetGraph.maxChannels = ringGraph.nChannels;
  NCCLCHECK(ncclTopoCompute(comm->topo, &collNetGraph));

pattern为NCCL_TOPO_PATTERN_TREE,因此机内仍然是ring,不过和NCCL_TOPO_PATTERN_SPLIT_TREE不同的地方在于NCCL_TOPO_PATTERN_TREE的第0个gpu会同时执行网卡的收发,假设搜出来的channel如下:

NET/0 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 NET/0

channel连接

图搜索完成之后,就开始channel的连接,即记录当前节点的相邻节点

ncclResult_t ncclTopoPreset(struct ncclComm* comm,
    struct ncclTopoGraph* treeGraph, struct ncclTopoGraph* ringGraph, struct ncclTopoGraph* collNetGraph,
    struct ncclTopoRanks* topoRanks) {
  int rank = comm->rank;
  int localRanks = comm->localRanks;
  int nChannels = comm->nChannels;

  for (int c=0; c<nChannels; c++) {
    struct ncclChannel* channel = comm->channels+c;
    int* collNetIntra = collNetGraph->intra+c*localRanks;

    for (int i=0; i<localRanks; i++) {
      if (collNetIntra[i] == rank) {
        int prev = (i-1+localRanks)%localRanks, next = (i+1)%localRanks;
        channel->collTreeDn.up      = collNetIntra[prev];
        channel->collTreeDn.down[0] = collNetIntra[next];
        channel->collTreeUp.down[0] = channel->collTreeDn.down[0];
        channel->collTreeUp.up      = channel->collTreeDn.up;
      }
    }
  }
  // Duplicate channels rings/trees
  struct ncclChannel* channel0 = comm->channels;
  struct ncclChannel* channel1 = channel0+nChannels;
  memcpy(channel1, channel0, nChannels*sizeof(struct ncclChannel));
  return ncclSuccess;
}

假设一共搜索到了10个channel,那么前5个channel用于执行SHARP的上行阶段,即图三箭头方向,后边5个channel用于执行下行阶段,即箭头的反方向。所以每个channel中记录了两组连接关系,即collTreeUp和collTreeDn,前5个上行channel只会用到collTreeUp,后5个下行channel只会用到collTreeDn,但其实collTreeUp和collTreeDn完全一致,后续只介绍collTreeUp。
此时机内channel连接之后如图四所示,箭头指向的是up。
在这里插入图片描述

图 4

然后开始处理机间的连接,其实就是选出机内负责网络收发的rank,以发送为例,称为sendIndex,需要断掉他的机内连接,以及断掉连到sendIndex的连接。
ncclResult_t ncclTopoConnectCollNet(struct ncclComm* comm, struct ncclTopoGraph* collNetGraph, int rank) {
  int nranks = comm->nRanks;
  int depth = nranks/comm->nNodes;
  int sendIndex = collNetGraph->pattern == NCCL_TOPO_PATTERN_TREE ? 0 : 1;  // send GPU index depends on topo pattern
  int sendEndIndex = (sendIndex+comm->localRanks-1)%comm->localRanks;
  for (int c=0; c<comm->nChannels/2; c++) {
    struct ncclChannel* channel = comm->channels+c;
    // Set root of collTree to id nranks
    if (rank == collNetGraph->intra[sendIndex+c*comm->localRanks]) { // is master
      channel->collTreeUp.up = channel->collTreeDn.up = nranks;
    }   
    if (rank == collNetGraph->intra[sendEndIndex+c*comm->localRanks]) { // is bottom of intra-node chain
      channel->collTreeUp.down[0] = channel->collTreeDn.down[0] = -1; 
    }   
    channel->collTreeUp.depth = channel->collTreeDn.depth = depth;
    INFO(NCCL_GRAPH, "CollNet Channel %d rank %d up %d down %d", c, rank, channel->collTreeUp.up, channel->collTreeUp.down[0]);
  }
  int recvIndex = 0;  // recv GPU index is always 0
  int recvEndIndex = (recvIndex+comm->localRanks-1)%comm->localRanks;
  for (int c=0; c<comm->nChannels/2; c++) {
    struct ncclChannel* channel = comm->channels+comm->nChannels/2+c;
    // Set root of collTree to id nranks
    if (rank == collNetGraph->intra[recvIndex+c*comm->localRanks]) { // is master
      channel->collTreeUp.up = channel->collTreeDn.up = nranks;
    }   
    if (rank == collNetGraph->intra[recvEndIndex+c*comm->localRanks]) { // is bottom of intra-node chain
      channel->collTreeUp.down[0] = channel->collTreeDn.down[0] = -1; 
    }   
    channel->collTreeUp.depth = channel->collTreeDn.depth = depth;
    INFO(NCCL_GRAPH, "CollNet Channel %d rank %d up %d down %d", comm->nChannels/2+c, rank, channel->collTreeDn.up, channel->collTreeDn.down[0]);
  }
  return ncclSuccess;
}

这里选择的sendIndex就是chain中的第一个,sendEndIndex就是chain中的最后一个,因为最后一个会连接到sendIndex,所以要断掉这俩的连接,然后遍历上行channel,将sendIndex的up设置为nranks,将sendEndIndex的down设置为-1,同理对下行channel。此时channel如图五所示:
在这里插入图片描述

图 5

通信链接的建立

然后开始建立机内和机间的通信链接。

if (comm->nNodes > 1 && 
      ncclParamCollNetEnable() == 1 && 
      collNetSupport() && collNetGraph.nChannels) {
    int logicChannels = comm->nChannels/2;
    int collNetSetupFail = 0; 
    const int recvIndex = 0;  // recv GPU index is always 0
    const int sendIndex = collNetGraph.pattern == NCCL_TOPO_PATTERN_TREE ? 0 : 1;  // send GPU index depends on topo pattern
    for (int c=0; c<logicChannels; c++) {
      struct ncclChannel* channelRecv = comm->channels+logicChannels+c;
      struct ncclChannel* channelSend = comm->channels+c;
      NCCLCHECK(ncclTransportP2pSetup(comm, &collNetGraph, channelRecv, 1, &channelRecv->collTreeDn.up, 1, channelRecv->collTreeDn.down));
      NCCLCHECK(ncclTransportP2pSetup(comm, &collNetGraph, channelSend, 1, channelSend->collTreeUp.down, 1, &channelSend->collTreeUp.up));
      const int recvMaster = collNetGraph.intra[c*comm->localRanks+recvIndex];
      const int sendMaster = collNetGraph.intra[c*comm->localRanks+sendIndex];
      if (collNetSetup(comm, &collNetGraph, channelRecv, rank, nranks, recvMaster, sendMaster, comm->nNodes, 1) != 1)
        collNetSetupFail = 1; 
      else if (collNetSetup(comm, &collNetGraph, channelSend, rank, nranks, sendMaster, recvMaster, comm->nNodes, 0) != 1)
        collNetSetupFail = 1; 
    }    
    // Verify CollNet setup across ranks
    NCCLCHECK(checkCollNetSetup(comm, rank, collNetSetupFail));
  }

以上行channel为例,即channelSend,通过ncclTransportP2pSetup建立机内的连接,由于sendIndex和sendEndIndex的up/down被设置为了nranks或-1,所以只会建立起图五中箭头所示的连接。
然后开始建立机间的通信链接,每个机器上的第一个rank负责网络收发,这个rank称为这个节点的master,然后开始建立sharp通信组,sharp通信组只包含了每个节点的master。

static int collNetSetup(struct ncclComm* comm, struct ncclTopoGraph* collNetGraph, struct ncclChannel* channel, int rank, int nranks,  int masterRank, int masterPeer, int nMasters, int type) {
  int rankInCollNet = -1;
  int supported = 0;
  int isMaster = (rank == masterRank) ? 1 : 0;
  struct {
    int collNetRank;
    ncclConnect connect;
  } sendrecvExchange;

  // check if we can connect to collnet, whose root is the nranks-th rank
  struct ncclPeerInfo *myInfo = comm->peerInfo+rank, *peerInfo = comm->peerInfo+nranks;
  peerInfo->rank = nranks;
  int ret = 1;
  if (isMaster) {
    NCCLCHECK(collNetTransport.canConnect(&ret, comm->topo, collNetGraph, myInfo, peerInfo));
  }

  // send master receives connect info from peer recv master
  ...

  // select
  struct ncclPeer* root = channel->peers+nranks;
  struct ncclConnector* conn = (type == 1) ? &root->recv : &root->send;
  struct ncclTransportComm* transportComm = (type == 1) ? &(collNetTransport.recv) : &(collNetTransport.send);
  conn->transportComm = transportComm;
  // setup
  struct ncclConnect myConnect;
  if (isMaster && ret > 0) {
    NCCLCHECK(transportComm->setup(comm->topo, collNetGraph, myInfo, peerInfo, &myConnect, conn, channel->id));
  }
  ...
}

先看下对recvMaster执行collNetSetup,isMaster表示当前这个节点是不是recvMater;然后设置peerInfo,这里的peerInfo用的是comm中的第nranks个ncclPeerInfo;然后通过canConnect判断是否可以通过collNet连接,这里直接返回1,表示可以连接。

然后执行transportComm->setup初始化通信相关资源。

ncclResult_t collNetRecvSetup(struct ncclTopoSystem* topo, struct ncclTopoGraph* graph, struct ncclPeerInfo* myInfo, struct ncclPeerInfo* peerInfo, struct ncclConnect* connectInfo, struct ncclConnector* recv, int channelId) {
  struct collNetRecvResources* resources;
  NCCLCHECK(ncclCalloc(&resources, 1));
  recv->transportResources = resources;

  NCCLCHECK(ncclTopoGetNetDev(topo, myInfo->rank, graph, channelId, &resources->netDev));
  NCCLCHECK(ncclTopoCheckGdr(topo, myInfo->busId, resources->netDev, 0, &resources->useGdr));

  NCCLCHECK(ncclCudaHostCalloc(&resources->hostSendMem, 1));
  resources->devHostSendMem = resources->hostSendMem;

  int recvSize = offsetof(struct ncclRecvMem, buff);
  for (int p=0; p<NCCL_NUM_PROTOCOLS; p++) recvSize += recv->comm->buffSizes[p];

  if (resources->useGdr) {
    NCCLCHECK(ncclCudaCalloc((char**)(&resources->devRecvMem), recvSize));
  }
  NCCLCHECK(ncclCudaHostCalloc((char**)&resources->hostRecvMem, recvSize));
  resources->devHostRecvMem = resources->hostRecvMem;

  NCCLCHECK(ncclIbMalloc((void**)&(resources->llData), recv->comm->buffSizes[NCCL_PROTO_LL]/2));
  struct collNetRecvConnectInfo* info = (struct collNetRecvConnectInfo*) connectInfo;
  NCCLCHECK(collNetListen(resources->netDev, &info->collNetHandle, &resources->netListenComm));
  return ncclSuccess;
}

分配collNetRecvResources resources,然后分配各个协议用到的显存,以及用于同步的head,tail等,最后执行collNetListen。
这里的赋值有点复杂,我们先看下resources的结构在执行完collNetListen之后会变成什么样。collNetRecvResources是nccl这里用到的,最后netListenComm指向一个ncclSharpListenComm,ncclSharpListenComm中的listenCommP2P指向一个ncclIbListenComm,ncclIbListenComm中保存了使用的网卡和socket的fd。

struct collNetRecvResources {
  void* netListenComm;   // ncclSharpListenComm
  ...
};
struct ncclSharpListenComm {
  int   dev;
  void *listenCommP2P;  // ncclIbListenComm
};

struct ncclIbListenComm {
  int dev;
  int fd; 
};

然后我们具体看下。

ncclResult_t ncclSharpListen(int dev, void* opaqueHandle, void** listenComm) {
  struct ncclSharpListenComm *lComm;
  ncclResult_t status;

  NCCLCHECK(ncclIbMalloc((void**)&lComm, sizeof(struct ncclSharpListenComm)));
  status = NCCL_PLUGIN_SYMBOL.listen(dev, opaqueHandle, &lComm->listenCommP2P);
  lComm->dev = dev;
  *listenComm = lComm;
  return status;
}

collNetListen执行的就是ncclSharpListen,其实就是调用ib_plugin的listen函数,这里可以看到collNetRecvResources的netListenComm被赋值成为了ncclSharpListenComm lComm。

ncclResult_t ncclIbListen(int dev, void* opaqueHandle, void** listenComm) {
  struct ncclIbListenComm* comm;
  comm = malloc(sizeof(struct ncclIbListenComm));
  memset(comm, 0, sizeof(struct ncclIbListenComm));
  struct ncclIbHandle* handle = (struct ncclIbHandle*) opaqueHandle;
  NCCL_STATIC_ASSERT(sizeof(struct ncclIbHandle) < NCCL_NET_HANDLE_MAXSIZE, "ncclIbHandle size too large");
  comm->dev = dev;
  NCCLCHECK(GetSocketAddr(&(handle->connectAddr)));
  NCCLCHECK(createListenSocket(&comm->fd, &handle->connectAddr));
  *listenComm = comm;
  return ncclSuccess;
}

ib_plugin的listen其实就是创建了一个listen socket,然后将网卡号dev,socket fd保存到ncclIbListenComm,ncclSharpListenComm的lComm被赋值成了这个ncclIbListenComm。ip和port保存到了opaqueHandle,即myConnect中。
到这里setup就执行结束了,继续看collNetSetup。

static int collNetSetup(struct ncclComm* comm, struct ncclTopoGraph* collNetGraph, struct ncclChannel* channel, int rank, int nranks,  int masterRank, int masterPeer, int nMasters, int type) {
  ...
  // prepare connect handles
  ncclResult_t res;
  struct {
    int isMaster;
    ncclConnect connect;
  } *allConnects = NULL;
  ncclConnect *masterConnects = NULL;
  NCCLCHECK(ncclCalloc(&masterConnects, nMasters));
  if (type == 1) {  // recv side: AllGather
    // all ranks must participate
    NCCLCHECK(ncclCalloc(&allConnects, nranks));
    allConnects[rank].isMaster = isMaster;
    memcpy(&(allConnects[rank].connect), &myConnect, sizeof(struct ncclConnect));
    NCCLCHECKGOTO(bootstrapAllGather(comm->bootstrap, allConnects, sizeof(*allConnects)), res, cleanup);
    // consolidate
    int c = 0;
    for (int r = 0; r < nranks; r++) {
      if (allConnects[r].isMaster) {
        memcpy(masterConnects+c, &(allConnects[r].connect), sizeof(struct ncclConnect));
        if (r == rank) rankInCollNet = c;
        c++;
      }
    }
  } else { // send side : copy in connect info received from peer recv master
    if (isMaster) memcpy(masterConnects+rankInCollNet, &(sendrecvExchange.connect), sizeof(struct ncclConnect));
  }
  // connect
  if (isMaster && ret > 0) {
    NCCLCHECKGOTO(transportComm->connect(masterConnects, nMasters, rankInCollNet, conn), res, cleanup);
    struct ncclPeer* devRoot = channel->devPeers+nranks;
    struct ncclConnector* devConn = (type == 1) ? &devRoot->recv : &devRoot->send;
    CUDACHECKGOTO(cudaMemcpy(devConn, conn, sizeof(struct ncclConnector), cudaMemcpyHostToDevice), res, cleanup);
  }
  ...
}

然后开始交换master的信息,分配nMaster个ncclConnect *masterConnects,nMaster就是节点数,然后将myConnect拷贝到masterConnects对应位置,执行allgather拿到所有rank的ncclConnect;然后将allConnects中所有的master对应的ncclConnect拷贝到masterConnects,最后执行transportComm的connect,完成sharp通信组的建立。

同理,我们先看下在执行connect之后resources中数据结构会变成什么样子。collNetRecvComm指向一个ncclSharpCollComm,ncclSharpCollComm中的recvComm和sendComm的作用类似bootstrap网络中连接前后节点,sharpCollContext为sharp context,sharpCollComm为sharp communicator。

struct collNetRecvResources {
  ...
  void* collNetRecvComm;  // ncclSharpCollComm
  ...
};
struct ncclSharpCollComm {
  ...
  void*  recvComm;  // ncclIbRecvComm
  void*  sendComm; // ncclIbSendComm
  ...
  struct sharp_coll_context* sharpCollContext;
  struct sharp_coll_comm*    sharpCollComm;  
};

ncclResult_t collNetRecvConnect(struct ncclConnect* connectInfos, int nranks, int rank, struct ncclConnector* recv) {
  // Setup device pointers
  struct collNetRecvResources* resources = (struct collNetRecvResources*)recv->transportResources;
  struct collNetSendConnectInfo* info = (struct collNetSendConnectInfo*)(connectInfos+rank);
  resources->collNetRank = rank;

  // Intermediate buffering on GPU for GPU Direct RDMA
  struct ncclRecvMem* recvMem = resources->useGdr ? resources->devRecvMem : resources->devHostRecvMem;
  int offset = 0;
  for (int p=0; p<NCCL_NUM_PROTOCOLS; p++) {
    recv->conn.buffs[p] = (p == NCCL_PROTO_LL ? resources->devHostRecvMem->buff : recvMem->buff) + offset;
    offset += recv->comm->buffSizes[p];
  }
  recv->conn.direct |= resources->useGdr ? NCCL_DIRECT_NIC : 0;

  // Head/Tail/Opcount are always on host
  recv->conn.tail = &resources->devHostRecvMem->tail;
  recv->conn.head = &resources->devHostSendMem->head;

  // Connect to coll comm
  collNetHandle_t** handlePtrs = NULL;
  NCCLCHECK(ncclCalloc(&handlePtrs, nranks));
  for (int i = 0; i < nranks; i++) {
    struct collNetRecvConnectInfo* info = (struct collNetRecvConnectInfo*)(connectInfos+i);
    handlePtrs[i] = &(info->collNetHandle);
  }
  ncclResult_t res;
  NCCLCHECKGOTO(collNetConnect((void**)handlePtrs, nranks, rank, resources->netListenComm, &resources->collNetRecvComm), res, cleanup);

  // Register buffers
  NCCLCHECK(collNetRegMr(resources->collNetRecvComm, recv->conn.buffs[NCCL_PROTO_SIMPLE], recv->comm->buffSizes[NCCL_PROTO_SIMPLE],
        resources->useGdr ? NCCL_PTR_CUDA : NCCL_PTR_HOST, &resources->mhandles[NCCL_PROTO_SIMPLE]));
  NCCLCHECK(collNetRegMr(resources->collNetRecvComm, resources->llData, recv->comm->buffSizes[NCCL_PROTO_LL]/2,
        NCCL_PTR_HOST, &resources->mhandles[NCCL_PROTO_LL]));

  // Create shared info between send and recv proxies
  NCCLCHECK(ncclCalloc(&(resources->reqFifo), NCCL_STEPS));

  // Pass info to send side
  info->reqFifo = resources->reqFifo;
  info->collNetComm = resources->collNetRecvComm;
  for (int p=0; p<NCCL_NUM_PROTOCOLS; p++)
    info->mhandles[p] = resources->mhandles[p];

cleanup:
  if (handlePtrs != NULL) free(handlePtrs);
  // Close listen comm
  NCCLCHECK(collNetCloseListen(resources->netListenComm));

  return res;
}

首先将resource中的head,tail,buffer等记录到conn,handlePtrs记录每个master rank的listen ip port,然后执行collNetConnect建立sharp通信组。

ncclResult_t ncclSharpConnect(void* handles[], int nranks, int rank, void* listenComm, void** collComm) {
  struct ncclSharpListenComm* lComm = (struct ncclSharpListenComm*)listenComm;
  struct ncclSharpCollComm* cComm;
  NCCLCHECK(ncclIbMalloc((void**)&cComm, sizeof(struct ncclSharpCollComm)));
  NCCLCHECK(ncclIbMalloc((void**)&cComm->reqs, sizeof(struct ncclSharpRequest)*MAX_REQUESTS));

  cComm->nranks = nranks;
  cComm->rank = rank;
  if (cComm->rank == -1) {
    WARN("Could not determine my rank\n");
    return ncclInternalError;
  }
  int next = (cComm->rank + 1) % nranks;
  NCCLCHECK(NCCL_PLUGIN_SYMBOL.connect(lComm->dev, handles[next], &cComm->sendComm));
  NCCLCHECK(NCCL_PLUGIN_SYMBOL.accept(lComm->listenCommP2P, &cComm->recvComm)); // From prev

  struct ncclSharpInfo* allInfo;
  pid_t pid = getpid();
  pthread_t tid = pthread_self();
  NCCLCHECK(ncclIbMalloc((void**)&allInfo, sizeof(struct ncclSharpInfo)*nranks));
  allInfo[cComm->rank].hostId = gethostid();
  allInfo[cComm->rank].jobId = (((uint64_t)allInfo[cComm->rank].hostId << 32) | ((pid ^ tid) ^ rand()));
  NCCLCHECK(ncclSharpAllGather(cComm, allInfo, sizeof(struct ncclSharpInfo)));

  // Find my local rank;
  int localRank = 0;
  for (int i=0; i<cComm->rank; i++) {
    if (allInfo[cComm->rank].hostId == allInfo[i].hostId) {
      localRank++;
    }   
  }
  uint64_t jobId = allInfo[0].jobId;
  free(allInfo);
  ...
}

创建ncclSharpCollComm cComm,这个最后会赋值给collNetRecvComm。类似bootstrap网络,这里会通过ib_plugin将所有的master rank首尾相连,这里connect和accept的逻辑和之前的ncclNetIb完全一致,这里不再赘述。然后创建ncclSharpInfo allInfo,记录hostid,随机一个jobId,执行allgather。

ncclResult_t ncclSharpConnect(void* handles[], int nranks, int rank, void* listenComm, void** collComm) {
  ...
  struct sharp_coll_init_spec init_spec = {0};
  init_spec.progress_func  = NULL;
  init_spec.job_id = jobId;
  init_spec.world_rank = cComm->rank;
  init_spec.world_size = nranks;
  init_spec.world_local_rank = 0;
  init_spec.enable_thread_support = 1;
  init_spec.group_channel_idx = 0;

  init_spec.oob_colls.barrier = ncclSharpOobBarrier;
  init_spec.oob_colls.bcast = ncclSharpOobBcast;
  init_spec.oob_colls.gather = ncclSharpOobGather;
  init_spec.oob_ctx = cComm;

  init_spec.config = sharp_coll_default_config;
  init_spec.config.user_progress_num_polls = 10000000;

  char devName[MAXNAMESIZE];
  ncclNetProperties_t prop;
  ncclSharpGetProperties(lComm->dev, &prop);
  snprintf(devName, MAXNAMESIZE, "%s:%d", prop.name, prop.port);
  init_spec.config.ib_dev_list = devName;

  int ret = sharp_coll_init(&init_spec, &cComm->sharpCollContext);

  INFO(NCCL_INIT, "Sharp rank %d/%d initialized on %s", cComm->rank, nranks, devName);

  if (ret < 0) {
    WARN("NET/IB :SHARP coll init error: %s(%d)\n", sharp_coll_strerror(ret), ret);
    return ncclInternalError;
  }

  struct sharp_coll_comm_init_spec comm_spec;
  comm_spec.rank = cComm->rank;
  comm_spec.size = nranks;
  comm_spec.oob_ctx = cComm;
  comm_spec.group_world_ranks = NULL;

  ret = sharp_coll_comm_init(cComm->sharpCollContext, &comm_spec, &cComm->sharpCollComm);
  if (ret < 0) {
    WARN("SHARP group create failed: %s(%d)\n", sharp_coll_strerror(ret), ret);
    return ncclInternalError;
  }

  *collComm = cComm;
  return ncclSuccess;

创建sharp_coll_init_spec init_spec用于初始化sharp通信上下文sharpCollContext,初始化init_spec,job_id设置为rank0的job_id,设置rank,size等,设置init_spec.oob_colls的oob_ctx为cComm,设置oob_colls.barrier等,oob_colls就类比于nccl的bootstrap网络,设置使用哪张网卡,然后执行sharp_coll_init,这里会初始化sharp,然后通过sharp_coll_comm_init用sharpCollContext初始化sharpCollComm,到这就完成了sharp通信组的建立。

然后回到ncclSharpConnect之后开始注册内存,会通过sharp_coll_reg_mr注册sharp内存注册,还需要ibv_reg_mr注册rdma内存。最后申请reqFifo,将reqFifo和collNetRecvComm记录到info,之后会发送给send。

ncclResult_t collNetRecvConnect(struct ncclConnect* connectInfos, int nranks, int rank, struct ncclConnector* recv) {
  ...
  // Register buffers
  NCCLCHECK(collNetRegMr(resources->collNetRecvComm, recv->conn.buffs[NCCL_PROTO_SIMPLE], recv->comm->buffSizes[NCCL_PROTO_SIMPLE],
        resources->useGdr ? NCCL_PTR_CUDA : NCCL_PTR_HOST, &resources->mhandles[NCCL_PROTO_SIMPLE]));
  NCCLCHECK(collNetRegMr(resources->collNetRecvComm, resources->llData, recv->comm->buffSizes[NCCL_PROTO_LL]/2,
        NCCL_PTR_HOST, &resources->mhandles[NCCL_PROTO_LL]));

  // Create shared info between send and recv proxies
  NCCLCHECK(ncclCalloc(&(resources->reqFifo), NCCL_STEPS));

  // Pass info to send side
  info->reqFifo = resources->reqFifo;
  info->collNetComm = resources->collNetRecvComm;
  for (int p=0; p<NCCL_NUM_PROTOCOLS; p++)
    info->mhandles[p] = resources->mhandles[p];

cleanup:
  if (handlePtrs != NULL) free(handlePtrs);
  // Close listen comm
  NCCLCHECK(collNetCloseListen(resources->netListenComm));

  return res;
}

然后回到collNetSetup,recv端会将info发送给send端,即reqFifo的地址和collNetRecvComm。

static int collNetSetup(struct ncclComm* comm, struct ncclTopoGraph* collNetGraph, struct ncclChannel* channel, int rank, int nranks,  int masterRank, int masterPeer, int nMasters, int type) {
  ...
  // recv side sends connect info to send side
  if (isMaster && type == 1) {
    sendrecvExchange.collNetRank = rankInCollNet;
    memcpy(&sendrecvExchange.connect, masterConnects+rankInCollNet, sizeof(struct ncclConnect));
    NCCLCHECKGOTO(bootstrapSend(comm->bootstrap, masterPeer, &sendrecvExchange, sizeof(sendrecvExchange)), res, cleanup);
  }
  if (ret > 0) {
    supported = 1;
  }
cleanup:
  if (allConnects != NULL) free(allConnects);
  if (masterConnects != NULL) free(masterConnects);
  return supported;
}

然后开始对send端执行collNetSetup,send端比较简单,主要就是显存的分配和注册,然后通过recv发送过来的info,将info中的reqFifo和comm记录到collNetSendResources。

到这里通信的建链就完成了,接下来我们看下api执行的过程

enqueue

如前所述,当用户执行了一个allreduce api之后会执行enqueue的操作。

ncclResult_t ncclSaveKernel(struct ncclInfo* info) {
  struct ncclColl coll;
  struct ncclProxyArgs proxyArgs;
  memset(&proxyArgs, 0, sizeof(struct ncclProxyArgs));
  NCCLCHECK(computeColl(info, &coll, &proxyArgs));

  info->comm->myParams->blockDim.x = std::max<unsigned>(info->comm->myParams->blockDim.x, info->nThreads);

  int nChannels = info->coll == ncclCollSendRecv ? 1 : coll.args.coll.nChannels;
  int nSubChannels = (info->pattern == ncclPatternCollTreeUp || info->pattern == ncclPatternCollTreeDown) ? 2 : 1;

  for (int bid=0; bid<nChannels*nSubChannels; bid++) {
    int channelId = (info->coll == ncclCollSendRecv) ? info->channelId :
      info->comm->myParams->gridDim.x % info->comm->nChannels;
    struct ncclChannel* channel = info->comm->channels+channelId;
    proxyArgs.channel = channel;
    // Adjust pattern for CollNet based on channel index
    if (nSubChannels == 2) {
      info->pattern = (channelId < info->comm->nChannels/nSubChannels) ? ncclPatternCollTreeUp : ncclPatternCollTreeDown;
    }
    NCCLCHECK(ncclProxySaveColl(&proxyArgs, info->pattern, info->root, info->comm->nRanks));
    info->comm->myParams->gridDim.x++;
    int opIndex = channel->collFifoTail;
    struct ncclColl* c = channel->collectives+opIndex;
    volatile uint8_t* activePtr = (volatile uint8_t*)&c->active;
    while (activePtr[0] != 0) sched_yield();

    memcpy(c, &coll, sizeof(struct ncclColl));
    if (info->coll != ncclCollSendRecv) c->args.coll.bid = bid % coll.args.coll.nChannels;

    c->active = 1;
    opIndex = (opIndex+1)%NCCL_MAX_OPS;
    c->nextIndex = opIndex;
    channel->collFifoTail = opIndex;
    channel->collCount++;
  }
  info->comm->opCount++;
  return ncclSuccess;
}

通过computeColl将算法,协议等信息记录到coll中,当使用sharp的时候算法为NCCL_ALGO_COLLNET,nChannels设置为搜索出来channel数的一半,pattern设置为ncclPatternCollTreeUp,info->nstepsPerLoop = info-> nchunksPerLoop = 1。
然后将coll记录到所有channel,上行channel的pattern为ncclPatternCollTreeUp,下行channel pattern为ncclPatternCollTreeDown,然后通过ncclProxySaveColl创建ncclProxyArgs。

kernel和proxy

__device__ void ncclAllReduceCollNetKernel(struct CollectiveArgs* args) {
  const int tid = threadIdx.x;
  const int nthreads = args->coll.nThreads-WARP_SIZE;
  const int bid = args->coll.bid;
  const int nChannels = args->coll.nChannels;
  struct ncclDevComm* comm = args->comm;
  struct ncclChannel* channel = comm->channels+blockIdx.x;
  const int stepSize = comm->buffSizes[NCCL_PROTO_SIMPLE] / (sizeof(T)*NCCL_STEPS);
  int chunkSize = args->coll.lastChunkSize;
  const ssize_t minChunkSize = nthreads*8*sizeof(uint64_t) / sizeof(T);
  const ssize_t loopSize = nChannels*chunkSize;
  const ssize_t size = args->coll.count;

  if (loopSize > size) {
    chunkSize = DIVUP(size, nChannels*minChunkSize)*minChunkSize;
  }

  // Compute pointers
  const T * __restrict__ thisInput = (const T*)args->sendbuff;
  T * __restrict__ thisOutput = (T*)args->recvbuff;

  if (blockIdx.x < nChannels) { // first half of the channels do reduce
    struct ncclTree* tree = &channel->collTreeUp;
    ncclPrimitives<UNROLL, 1, 1, T, 1, 1, 0, FUNC> prims(tid, nthreads, tree->down, &tree->up, NULL, stepSize, channel, comm);
    for (ssize_t gridOffset = 0; gridOffset < size; gridOffset += loopSize) {
      // Up
      ssize_t offset = gridOffset + bid*chunkSize;
      int nelem = min(chunkSize, size-offset);
      if (tree->up == -1) {
        prims.recvReduceCopy(thisInput+offset, thisOutput+offset, nelem);
      } else if (tree->down[0] == -1) {
        prims.send(thisInput+offset, nelem);
      } else {
        prims.recvReduceSend(thisInput+offset, nelem);
      }
    }
  }
  ...
}

先看上行过程,nChannels之前有除以2,所以小于nChannels的都是上行channel,如果是sendEndIndex,那么他的down为-1,因此直接通过send将自己的数据发送给下一个rank的buffer,如果是非sendEndIndex,那么需要接收上一个rank发送过来的数据,和自己userbuff里对应数据执行reduce,然后发送给up的buffer。

然后看下proxy的send流程:

ncclResult_t collNetSendProxy(struct ncclProxyArgs* args) {
  ...
  if (args->state == ncclProxyOpProgress) {
    int p = args->protocol;
    int stepSize = args->connector->comm->buffSizes[p] / NCCL_STEPS;
    char* localBuff = args->connector->conn.buffs[p];
    void* sendMhandle = resources->sendMhandles[p];
    void* recvMhandle = resources->recvMhandles[p];
    args->idle = 1;
    struct reqSlot* reqFifo = resources->reqFifo;
    if (args->head < args->end) {
      int buffSlot = args->tail%NCCL_STEPS;
      if (args->tail < args->end && args->tail < args->head + NCCL_STEPS
          && reqFifo[buffSlot].recvBuff != NULL) {
        volatile int* sizesFifo = resources->hostRecvMem->sizesFifo;
        volatile uint64_t* recvTail = &resources->hostRecvMem->tail;
        if (args->protocol == NCCL_PROTO_LL) {
        } else if (args->tail < *recvTail) {
          // Send through network
          if (sizesFifo[buffSlot] != -1) {
            int count = sizesFifo[buffSlot]/ncclTypeSize(args->dtype);
            NCCLCHECK(collNetIallreduce(resources->collNetSendComm, localBuff+buffSlot*stepSize, (void*)(reqFifo[buffSlot].recvBuff), count, args->dtype, args->redOp, sendMhandle, recvMhandle, args->requests+buffSlot));
            if (args->requests[buffSlot] != NULL) {
              sizesFifo[buffSlot] = -1; 
              // Make sure size is reset to zero before we update the head.
              __sync_synchronize();
              args->tail += args->sliceSteps;
              args->idle = 0;
            }   
          }   
        }   
      }   
      if (args->head < args->tail) {
        int done, size;
        int buffSlot = args->head%NCCL_STEPS;
        NCCLCHECK(collNetTest((void*)(args->requests[buffSlot]), &done, &size));
        if (done) {
          TRACE(NCCL_NET, "sendProxy [%d/%d] request %p done, size %d", args->head, buffSlot, args->requests[buffSlot], size);
          reqFifo[buffSlot].size = size;
          // Make sure size is updated before we set recvBuff to NULL (from the view of recv proxy, concerning the flush)
          // (reordered store after store is possible on POWER, though not on x86)
          __sync_synchronize();
          reqFifo[buffSlot].recvBuff = NULL; // Notify recvProxy
          args->head += args->sliceSteps;
          resources->hostSendMem->head = args->head;
          args->idle = 0;
        }   
      }   
    }   
    if (args->head == args->end) {
      resources->step = args->end;
      args->idle = 0;
      args->state = ncclProxyOpNone;
    }   
}

collNetIallreduce就是将sendbuff,recvbuff和对应的mr填充到sharp_coll_reduce_spec,然后执行sharp_coll_do_allreduce或sharp_coll_do_allreduce_nb,等到执行完成之后,reduce的结果就会填充到recvbuff。
这里sendbuff就是send conn中的buff,recvbuff就是recv conn的buff,SendProxy不知道recv conn中buff哪块可用,这里就是通过reqFifo完成send和recv之间的协调的,所以可以看到这里判断能否发送数据的条件除了常规的判断队列有否有数据之外,还判断对应reqFifo的recvBuff是否为NULL,都满足条件才能发送。
完成发送后将tail加上sliceSteps,如果head小于tail,说明有已发送但未完成的allreduce,那么通过sharp_coll_req_test判断对应的请求是否完成如果完成,将head加上sliceSteps,将对应的recvBuff设置为NULL以通知RecvProxy这个req已经完成了。

然后看下RecvProxy:

ncclResult_t collNetRecvProxy(struct ncclProxyArgs* args) {
    if (args->head < args->end) {
      if ((args->tail < args->head + NCCL_STEPS) && (args->tail < (resources->hostSendMem->head) + NCCL_STEPS) && (args->tail < args->end)) {
        int buffSlot = args->tail%NCCL_STEPS;
        char* recvBuff = p == NCCL_PROTO_LL ? (char*)resources->llData : localBuff;
        int recvStepSize = p == NCCL_PROTO_LL ? stepSize/2 : stepSize;
        reqFifo[buffSlot].recvBuff = recvBuff+buffSlot*recvStepSize;
        TRACE(NCCL_NET, "recvProxy [%d/%d] posted buffer %p", args->tail, buffSlot, reqFifo[buffSlot].recvBuff);
        args->tail += args->sliceSteps;
        args->idle = 0;
      }   
      if (args->tail > args->head) {
        int buffSlot = args->head%NCCL_STEPS;
        if (reqFifo[buffSlot].recvBuff == NULL) { // Buffer is cleared : coll is complete
          TRACE(NCCL_NET, "recvProxy [%d/%d] done, size %d", args->head, buffSlot, reqFifo[buffSlot].size);
          args->head += args->sliceSteps;
          if (args->protocol == NCCL_PROTO_LL) { // ll
          } else if (args->protocol == NCCL_PROTO_SIMPLE) {
            if (resources->useGdr) NCCLCHECK(collNetFlush(resources->collNetRecvComm, localBuff+buffSlot*stepSize, reqFifo[buffSlot].size, mhandle));
            resources->hostRecvMem->tail = args->head;
          }   
          args->idle = 0;
        }
      }
    }
}

这里可以看到只要队列有空间,那么就下发对应的recvbuf到reqFifo,如果tail大于head,说明有未完成的请求,那么判断对应的recvbuff是否为NULL,如果为NULL说明已经完成,那么将head加上sliceSteps,然后执行collNetFlush保证数据落盘,这里的flush和ncclNetIb一致,都是读本地QP,flush之后将resources->hostRecvMem->tai设置为head,以通知kernel有新数据。

最后看下recv kernel:

template<int UNROLL, class FUNC, typename T>
__device__ void ncclAllReduceCollNetKernel(struct CollectiveArgs* args) {
  ...
  if (blockIdx.x >= nChannels) { // second half of the channels do broadcast
    struct ncclTree* tree = &channel->collTreeDn;
    ncclPrimitives<UNROLL, 1, 1, T, 1, 1, 0, FUNC> prims(tid, nthreads, &tree->up, tree->down, NULL, stepSize, channel, comm);
    for (ssize_t gridOffset = 0; gridOffset < size; gridOffset += loopSize) {
      // Down
      ssize_t offset = gridOffset + bid*chunkSize;
      int nelem = min(chunkSize, size-offset);
      if (tree->up == -1) {
        prims.send(thisOutput+offset, nelem);
      } else if (tree->down[0] == -1) {
        prims.recv(thisOutput+offset, nelem);
      } else {
        prims.recvCopySend(thisOutput+offset, nelem);
      }   
    }   
  }
}

如果是recvEndIndex,那么只需要recv数据即可,如果不是recvEndIndex,那么通过recvCopySend将数据从up的buffe中接收,copy到自己的user buffer对应位置,并发送给down的buffer。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/475491.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

淘宝店铺如何从1688一键铺货?官方授权API接口,可满足多样化上货需求

那么新手卖家如何将1688的源头厂货一键铺货到淘宝店铺呢&#xff1f;下面我教大家几招&#xff1a; 1、通过淘宝复制一键复制上货 淘宝API接口采集 taobao.item_get 公共参数 名称类型必须描述keyString是调用key&#xff08;必须以GET方式拼接在URL中&#xff09;secretStr…

无粉乳胶手套:提升操作效率与舒适度的理想选择

作为一种新型的个人防护装备&#xff0c;无粉乳胶手套因其清洁、舒适和高靈敏度的特点&#xff0c;逐渐成为各个行业的首选。本文将通过对比分析无粉乳胶手套与其他手套的差异&#xff0c;探讨其在提升操作效率和舒适度方面的优势。 提升操作效率&#xff1a; 无粉乳胶手套具有…

[C语言]——函数递归

目录 一.什么是递归 1.递归的思想&#xff1a; 二.递归的限制条件 三.递归举例 1.举例1&#xff1a;求n的阶乘 1.1分析和代码实现 1.2画图推演 2.举例2&#xff1a;顺序打印⼀个整数的每⼀位 2.1分析和代码实现 2.2画图推演 四.递归与迭代 1.举例3&#xff1a;求第…

记录一次服务器内存使用率过高达到90%告警问题排查。

目录 一、前言二、问题排查处理三、 结尾 &#x1f469;&#x1f3fd;‍&#x1f4bb;个人主页&#xff1a;阿木木AEcru &#x1f525; 系列专栏&#xff1a;Docker容器化部署系列 &#x1f4b9;每一次技术突破&#xff0c;都是对自我能力的挑战和超越。 一、前言 一大早就有一…

EtherCAT 开源主站 IGH 在 linux 开发板的移植和伺服通信测试

手边有一套正点原子linux开发板imax6ul&#xff0c;一直在吃灰&#xff0c;周末业余时间无聊&#xff0c;把EtherCAT的开源IGH主站移植到开发板上玩玩儿&#xff0c;搞点事情做。顺便学习研究下EtherCAT总线协议及其对伺服驱动器的运动控制过程。实验很有意思&#xff0c;这里总…

01mysql

登陆mysql 默认数据库 进入&#xff0c;展示&#xff0c;删除 &#xff0c;查看当前正使用的库 select version()查看版本 查看表结构desc 查询 not in不会忽略空 in会自动忽略 like模糊查询 %o%中间带o的 _A%第二个字母是A的 查名字是下划线的 %\_% 排序 order …

Spring Batch | quick start

导学 官网介绍&#xff1a;https://docs.spring.io/spring-batch/reference/spring-batch-intro.html批处理就是将数据分批次进行处理的过程。常规的批处理操作步骤&#xff1a;系统A从数据库中导出数据到文件&#xff0c;系统B读取文件数据并写入到数据库批处理特点&#xff…

matlab 最小二乘拟合圆柱

目录 一、算法原理1、算法简介2、参考文献二、代码实现三、结果展示四、测试数据本文由CSDN点云侠原创,原文链接。如果你不是在点云侠的博客中看到该文章,那么此处便是不要脸的爬虫。 一、算法原理 1、算法简介 圆柱拟合步骤主要包括两步: 一是确定柱面模型参数初始值; 二是…

粤嵌6818开发板如何理解Linux文件IO?

一、文件IO的概述 1、什么是文件&#xff1f; Linux下一切皆文件。普通文件、目录文件、管道文件、套接字文件、链接文件、字符设备文件、块设备文件。 2、什么是IO&#xff1f; input output&#xff1a;输入输出 3、什么是文件IO&#xff1f; 对文件的输入输出&#xff0c;把…

鸿蒙一次开发,多端部署(五)页面开发的一多能力介绍

本章介绍如何使用方舟开发框架“一多”能力&#xff0c;开发出在多设备上正常显示的页面。方舟开发框架推荐开发者使用声明式开发范式开发应用&#xff0c;故本章的内容和示例都主要基于声明式开发范式。本章主要包含如下内容&#xff1a; 布局能力 布局决定了页面中的元素按照…

【Git】恢复被 git checkout -- . 重置掉的代码(或未commit的代码)

文章目录 有时候大家在工作区的代码未提交被重置了&#xff0c;然后又没有commit记录&#xff0c;这时候就可以尝试下面这种方法找回。 在 vscode 中&#xff0c;先打开被重置的文件点击该文件的本地时间线&#xff0c;就可以在 vscode 中看到该文件的本地记录 这里可以找到工作…

ai写作助手,破解写作遇到的难题

在写作过程中&#xff0c;有些人会遇到不同写作的难题&#xff0c;没写作灵感写不出文章&#xff0c;文笔差不会写文章&#xff0c;没有时间去写文章等等&#xff0c;然而把这些问题放在一个ai写作助手面前来说&#xff0c;却不成问题&#xff0c;那是因为ai写作助手它具备强大…

OR-6N137高速隔离运放光耦,对标6N137等

高效率的AlGaAs LED和高速光学探测器组成 拥有交流和直流隔离 改善了传统光电晶体管耦合器的速度 共模瞬态抗扰度 特征 VCM1000V&#xff0c;瞬时共模抑制&#xff1a;10KV/μsec 宽工作温度范围 -40~85C 高输入输出隔离电压 &#xff08; Viso 5000Vrms &#xff09; …

Spring MVC(二)-过滤器与拦截器

过滤器和拦截器在职责和使用场景上存在一些差异。 过滤器 拦截器 作用 对请求进行预处理和后处理。例如过滤请求参数、设置字符编码。 拦截用户请求并进行相应处理。例如权限验证、用户登陆检查等。 工作级别 Servlet容器级别&#xff0c;是Tomcat服务器创建的对象。可以…

Docker 中 Nginx 反向代理

本文主角&#xff1a;Nginx Proxy Manager 。 使用docker安装Nginx Proxy Manager。 1、找到C:\Windows\System32\drivers\etc下的hosts文件&#xff0c;添加 “域名 IP"即可。 使用vscode编辑文件&#xff0c;保存时会提示用管理员权限保存即可。 2、Nginx Proxy Mana…

AD实用设置教程

一、“ 多边形敷铜 ” 设置 “ 最小间隔 ” 在AD9中设置多边形敷铜的间距&#xff0c;可以按照以下步骤进行&#xff1a; 打开一个PCB文件&#xff0c;在PCB工程界面选择“设计”->“规则”->“Electrical”->“Clearance”。在“Clearance”上右键选择“新建规则”…

C语言刷题1

和黛玉学编程呀 这期就是普普通通题目和答案啦&#xff0c;大都也比较基础&#xff0c;适合初学者&#xff0c;下期我们就更单链表啦 求Snaaaaaaaaaaaaaaa的前5项之和&#xff0c;其中a是一个数字&#xff0c; 例如&#xff1a;222222222222222 int main() {int a 0;int n …

安装调试kotti_ai:AI+互联网企业级部署应用软件包@riscv+OpenKylin

先上结论&#xff1a;riscvOpenKylin可以安装pyramidkottikotti_ai 但是paddle_serving_client无法安装&#xff0c;项目的AI实现部分需要改用其它方法&#xff0c;比如onnx。最终onnx也没有装成&#xff0c;只好用飞桨自己的推理。 安装kotti pip install kotti 安装kotti和…

多人命题系统|基于SSM框架+ Mysql+Java+ B/S结构的多人命题系统设计与实现(可运行源码+数据库+设计文档)

推荐阅读100套最新项目 最新ssmjava项目文档视频演示可运行源码分享 最新jspjava项目文档视频演示可运行源码分享 最新Spring Boot项目文档视频演示可运行源码分享 2024年56套包含java&#xff0c;ssm&#xff0c;springboot的平台设计与实现项目系统开发资源&#xff08;可…

JavaSE:数据类型与变量

目录 一、前言 二、数据类型与变量 &#xff08;一&#xff09;字面常量 &#xff08;二&#xff09;数据类型 &#xff08;三&#xff09;变量 1.变量概念 2.语法格式 3.整型变量 3.1整型变量 3.2长整型变量 3.3短整型变量 3.4字节型变量 4.浮点型变量 4.1双精…