gpu-manager安装及测试

提示:GPU-manager安装为主部分内容做了升级开箱即用,有用请点收藏❤抱拳

文章目录

  • 前言
  • 一、约束条件
  • 二、使用步骤
    • 1.下载镜像
        • 1.1 查看当前虚拟机的驱动类型:
    • 2.部署gpu-manager
    • 3.部署gpu-admission
    • 4.修改kube-scheduler.yaml![在这里插入图片描述](https://img-blog.csdnimg.cn/f76aab5b5bfc4f9c80eb4cd4721efeee.png)
      • 4.1 新建/etc/kubernetes/scheduler-policy-config.json
      • 4.2 新建/etc/kubernetes/scheduler-extender.yaml
      • 4.3 修改/etc/kubernetes/manifests/kube-scheduler.yaml
    • 4.1 结果查看
    • 测试
  • 总结


前言

本文只做开箱即用部分,想了解GPUManager虚拟化方案技术层面请直接点击:GPUmanager虚拟化方案


一、约束条件

1、虚拟机需要完成直通式绑定,也就是物理GPU与虚拟机绑定,我做的是hyper-v的虚拟机绑定参照上一篇文章
2、对于k8s要求1.10版本以上
3、GPU-Manager 要求集群内包含 GPU 机型节点
4、每张 GPU 卡一共有100个单位的资源,仅支持0 - 1的小数卡,以及1的倍数的整数卡设置。显存资源是以256MiB为最小的一个单位的分配显存
我的版本:k8s-1.20

二、使用步骤

1.下载镜像

镜像地址:https://hub.docker.com/r/tkestack/gpu-manager/tags
manager:docker pull tkestack/gpu-manager:v1.1.5
https://hub.docker.com/r/tkestack/gpu-quota-admission/tags
admission:docker pull tkestack/gpu-quota-admission:v1.0.0

1.1 查看当前虚拟机的驱动类型:

docker info 

在这里插入图片描述

2.部署gpu-manager

拥有GPU节点打标签:

kubectl label node XX nvidia-device-enable=enable

如果docker驱动是systemd 需要在yaml指定,因为GPUmanager默认cgroupfs
在这里插入图片描述
创建yaml内容如下:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-manager
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-manager-role
subjects:
  - kind: ServiceAccount
    name: gpu-manager
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        # This toleration is deprecated. Kept here for backward compatibility
        # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # only run node has gpu device
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      containers:
        - image: tkestack/gpu-manager:v1.1.5
          imagePullPolicy: IfNotPresent
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - name: checkpoint
              mountPath: /etc/gpu-manager/checkpoint
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
            - name: kube-root
              mountPath: /root/.kube
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "4"
            - name: EXTRA_FLAGS
              value: "--cgroup-driver=systemd"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: checkpoint
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/checkpoint
        # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
        # inode change after host docker is restarted
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        # We have to mount /usr directory instead of specified library path, because of non-existing
        # problem for different distro
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr
        - name: kube-root
          hostPath:
            type: Directory
            path: /root/.kube

执行yaml文件:

kubectl apply -f gpu-manager.yaml
kubectl get pod -A|grep  gpu 查询结果

3.部署gpu-admission

创建yaml内容如下:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-admission
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-admission-as-kube-scheduler
subjects:
  - kind: ServiceAccount
    name: gpu-admission
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:kube-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-admission-as-volume-scheduler
subjects:
  - kind: ServiceAccount
    name: gpu-admission
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:volume-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-admission-as-daemon-set-controller
subjects:
  - kind: ServiceAccount
    name: gpu-admission
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:controller:daemon-set-controller
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
    app: gpu-admission
  name: gpu-admission
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: gpu-admission
      containers:
        - image: thomassong/gpu-admission:47d56ae9
          name: gpu-admission
          env:
            - name: LOG_LEVEL
              value: "4"
          ports:
            - containerPort: 3456
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      priority: 2000000000
      priorityClassName: system-cluster-critical
---
apiVersion: v1
kind: Service
metadata:
  name: gpu-admission
  namespace: kube-system
spec:
  ports:
    - port: 3456
      protocol: TCP
      targetPort: 3456
  selector:
    app: gpu-admission
  type: ClusterIP

执行yaml文件:

kubectl create -f gpu-admission.yaml
kubectl get pod -A|grep  gpu 查询结果

4.修改kube-scheduler.yaml在这里插入图片描述

4.1 新建/etc/kubernetes/scheduler-policy-config.json

创建内容:

vim /etc/kubernetes/scheduler-policy-config.json
复制如下内容:
{
    "kind": "Policy",
    "apiVersion": "v1",
    "predicates": [
        {
            "name": "PodFitsHostPorts"
        },
        {
            "name": "PodFitsResources"
        },
        {
            "name": "NoDiskConflict"
        },
        {
            "name": "MatchNodeSelector"
        },
        {
            "name": "HostName"
        }
    ],
    "priorities": [
        {
            "name": "BalancedResourceAllocation",
            "weight": 1
        },
        {
            "name": "ServiceSpreadingPriority",
            "weight": 1
        }
    ],
    "extenders": [
        {
            "urlPrefix": "http://gpu-admission.kube-system:3456/scheduler",
            "apiVersion": "v1beta1",
            "filterVerb": "predicates",
            "enableHttps": false,
            "nodeCacheCapable": false
        }
    ],
    "hardPodAffinitySymmetricWeight": 10,
    "alwaysCheckAllPredicates": false
}

4.2 新建/etc/kubernetes/scheduler-extender.yaml

创建内容:

vim /etc/kubernetes/scheduler-extender.yaml
复制如下内容:
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: "/etc/kubernetes/scheduler.conf"
algorithmSource:
  policy:
    file:
      path: "/etc/kubernetes/scheduler-policy-config.json"


4.3 修改/etc/kubernetes/manifests/kube-scheduler.yaml

修改内容:

vim /etc/kubernetes/manifests/kube-scheduler.yaml
复制如下内容:
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
    - command:
        - kube-scheduler
        - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
        - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
        - --bind-address=0.0.0.0
        - --feature-gates=TTLAfterFinished=true,ExpandCSIVolumes=true,CSIStorageCapacity=true,RotateKubeletServerCertificate=true
        - --kubeconfig=/etc/kubernetes/scheduler.conf
        - --leader-elect=true
        - --port=0
        - --config=/etc/kubernetes/scheduler-extender.yaml
      image: registry.cn-beijing.aliyuncs.com/kubesphereio/kube-scheduler:v1.22.10
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 8
        httpGet:
          path: /healthz
          port: 10259
          scheme: HTTPS
        initialDelaySeconds: 10
        periodSeconds: 10
        timeoutSeconds: 15
      name: kube-scheduler
      resources:
        requests:
          cpu: 100m
      startupProbe:
        failureThreshold: 24
        httpGet:
          path: /healthz
          port: 10259
          scheme: HTTPS
        initialDelaySeconds: 10
        periodSeconds: 10
        timeoutSeconds: 15
      volumeMounts:
        - mountPath: /etc/kubernetes/scheduler.conf
          name: kubeconfig
          readOnly: true
        - mountPath: /etc/localtime
          name: localtime
          readOnly: true
        - mountPath: /etc/kubernetes/scheduler-extender.yaml
          name: extender
          readOnly: true
        - mountPath: /etc/kubernetes/scheduler-policy-config.json
          name: extender-policy
          readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
    - hostPath:
        path: /etc/kubernetes/scheduler.conf
        type: FileOrCreate
      name: kubeconfig
    - hostPath:
        path: /etc/localtime
        type: File
      name: localtime
    - hostPath:
        path: /etc/kubernetes/scheduler-extender.yaml
        type: FileOrCreate
      name: extender
    - hostPath:
        path: /etc/kubernetes/scheduler-policy-config.json
        type: FileOrCreate
      name: extender-policy
status: {}

修改内容入下:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
修改完成k8s自动重启,如果没有重启执行 kubectl delete pod -n [podname]

4.1 结果查看

执行命令:

 kubectl describe node master[节点名称]

在这里插入图片描述

测试

镜像下载:docker pull gaozhenhai/tensorflow-gputest:0.2
创建yaml内容: vim vcuda-test.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: vcuda-test
    qcloud-app: vcuda-test
  name: vcuda-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: vcuda-test
  template:
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
    spec:
      containers:
        - command:
            - sleep
            - 360000s
          env:
            - name: PATH
              value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
          image: gaozhenhai/tensorflow-gputest:0.2
          imagePullPolicy: IfNotPresent
          name: tensorflow-test
          resources:
            limits:
              cpu: "4"
              memory: 8Gi
              tencent.com/vcuda-core: "50"
              tencent.com/vcuda-memory: "32"
            requests:
              cpu: "4"
              memory: 8Gi
              tencent.com/vcuda-core: "50"
              tencent.com/vcuda-memory: "32"

启动yaml:kubectl apply -f vcuda-test.yaml
进入容器:

kubectl exec -it `kubectl get pods -o name | cut -d '/' -f2` -- bash

执行测试命令:

cd /data/tensorflow/cifar10 && time python cifar10_train.py

查看结果:

执行命令:nvidia-smi pmon -s u -d 1、命令查看GPU资源使用情况

总结

到此vgpu容器层虚拟化全部完成

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/67523.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

siMLPe:Human Motion Prediction

Back to MLP: A Simple Baseline for Human Motion Prediction解析 摘要1. 简介2. Related Work2.1 基于RNN的人体运动预测2.2 基于GCN的人体运动预测2.3 基于 Attention 的人类运动预测2.4 总结 3. siMLPe3.1 离散余弦变换(Discrete Cosine Transform (DCT)&#x…

国产芯力特SIT1024QHG四通道本地互联网络(LIN)收发器,可替代TJA1024HG

SIT1024Q 是一款四通道本地互联网络(LIN)物理层收发器,符合 LIN 2.0、LIN 2.1、LIN 2.2、 LIN 2.2A 、 ISO 17987-4:2016 (12V) 和 SAE J2602 标准。主要适用于使用 1kbps 至 20kbps 传输速 率的车载网络。 SIT1024Q 通过 TXDx 引…

【LeetCode】打家劫舍||

打家劫舍|| 题目描述算法分析编程代码 链接: 打家劫舍|| 在做这个题之前&#xff0c;建议大家做一下这个链接: 按摩师 我的博客里也有这个题的讲解&#xff0c;名字是按摩师 题目描述 算法分析 编程代码 class Solution { public:int maxrob(vector<int>nums,int left,…

【C++】C语言基础部分知识点总结 (指针,函数,内存,关键字,预处理等)(秋招篇)

文章目录 前言讲一下32位系统常用数据类型的字节大小&#xff08;stm32f103为例&#xff09;讲一些C/C中常见的库什么是易变变量&#xff1f;代码的转化和构建通常会经历哪几个步骤&#xff1a;&#xff08;预处理&#xff0c;编译&#xff0c;汇编&#xff0c;链接&#xff09…

Elastic Stack 8.9:更快的跨集群搜索和指标聚合

作者&#xff1a;Tyler Perkins, Gilad Gal, Teresa Soler, Shani Sagiv, Bernhard Suhm, George Kobar Elastic Stack 8.9 在多个方面实现了显着的性能改进&#xff1a;Kibana 中更快的跨集群搜索、Elasticsearch 更快的聚合&#xff0c;以及更快、更相关的向量搜索&#xff0…

Android Studio实现刮刮卡效果

代码和刮刮乐图片参考网络 实现效果 MainActivity import android.app.Activity; import android.os.Bundle;public class MainActivity extends Activity {Overrideprotected void onCreate(Bundle savedInstanceState) {super.onCreate(savedInstanceState);setContentVi…

【Java可执行程序命令】学习路线攻略,史诗级别全汇总 ~

Java可执行程序命令学习路线攻略 &#x1f4d7;文章指路Java可执行命令1、编译工具 javac2、程序启动工具 java3、API文档生成 javadoc4、反编译工具 javap5、打包部署工具 jar6、调试工具 jdb7、C头文件创建 javah8、JWS应用程序启动 javaws9、安装包创建 javapackager10、JAR…

Vue3状态管理库Pinia——自定义持久化插件

个人简介 &#x1f440;个人主页&#xff1a; 前端杂货铺 &#x1f64b;‍♂️学习方向&#xff1a; 主攻前端方向&#xff0c;正逐渐往全干发展 &#x1f4c3;个人状态&#xff1a; 研发工程师&#xff0c;现效力于中国工业软件事业 &#x1f680;人生格言&#xff1a; 积跬步…

PostgreSQL和MySQL多维度对比

文章目录 0.前言1. 基础对比2.PostgreSQL和MySQL语法对比3. 特性4. 参考文档 0.前言 在当今的软件开发和数据管理领域&#xff0c;数据库是至关重要的基础设施之一。选择正确的数据库管理系统&#xff08;DBMS&#xff09;对于应用程序的性能、可扩展性和数据完整性至关重要。…

SpringBoot集成Logback日志

SpringBoot集成Logback日志 文章目录 SpringBoot集成Logback日志一、什么是日志二、Logback简单介绍三、SpringBoot项目中使用Logback四、概念介绍一、日志记录器Logger1.1、日志记录器对象生成1.2、记录器的层级结构1.3、过滤器1.4、logger设置日志级别1.5、java代码演示1.6、…

【论文阅读】Deep Instance Segmentation With Automotive Radar Detection Points

基于汽车雷达检测点的深度实例分割 一个区别&#xff1a; automotive radar 汽车雷达 &#xff1a; 分辨率低&#xff0c;点云稀疏&#xff0c;语义上模糊&#xff0c;不适合直接使用用于密集LiDAR点开发的方法 &#xff1b; 返回的物体图像不如LIDAR精确&#xff0c;可以…

API 测试 | 了解 API 接口概念|电商平台 API 接口测试指南

什么是 API&#xff1f; API 是一个缩写&#xff0c;它代表了一个 pplication P AGC 软件覆盖整个房间。API 是用于构建软件应用程序的一组例程&#xff0c;协议和工具。API 指定一个软件程序应如何与其他软件程序进行交互。 例行程序&#xff1a;执行特定任务的程序。例程也称…

Matlab滤波、频谱分析

Matlab滤波、频谱分析 滤波&#xff1a; 某目标信号是由5、15、30Hz正弦波混合而成的混合信号&#xff0c;现需要设计一个滤波器滤掉5、30Hz两种频率。 分析&#xff1a;显然我们应该设计一个带通滤波器&#xff0c;通带频率落在15Hz附近。 % 滤波 % 某目标信号是由5、15、3…

LVS工作环境配置

一、LVS-DR工作模式配置 模拟环境如下&#xff1a; 1台客户机 1台LVS负载调度器 2台web服务器 1、环境部署 &#xff08;1&#xff09;LVS负载调度器 yum install -y ipvsadm # 在LVS负载调度器上进行环境安装 ifconfig ens33:200 192.168.134.200/24 # 配置LVS的VIP…

Mir 2.14 正式发布,Ubuntu 使用的 Linux 显示服务器

导读Canonical 公司最近发布了 Mir 2.14&#xff0c;这是该项目的最新版本。 Mir 2.14 在 Wayland 方面通过 ext-session-lock-v1 协议增加了对屏幕锁定器 (screen lockers) 的支持&#xff0c;并最终支持 Wayland 拖放。此外还整合了渲染平台的实现&#xff0c;放弃了之前在 R…

嵌入式领域:人才供需失衡,发展潜力巨大

嵌入式技术正快速发展&#xff0c;ARM处理器、嵌入式操作系统、LINUX等技术助力嵌入式领域崛起。然而&#xff0c;行业新颖且门槛高&#xff0c;缺乏专业指导。因此&#xff0c;嵌入式人才稀缺&#xff0c;身价水涨船高。 未来几年&#xff0c;嵌入式系统将在信息化、智能化、…

【位操作符的几种题型】

位操作符的几种题型 目录 题型一&#xff1a;寻找“单身狗”。 题型二&#xff1a;计算一个数在二进制中1的个数 题型三&#xff1a;不允许创建临时变量&#xff0c;交换两个整数的内容 题型一&#xff1a;寻找“单身狗”。 1.1题目解析 在一个整型数组中&#xff0c;只有…

【数据结构】‘双向链表’冲冲冲

&#x1f490; &#x1f338; &#x1f337; &#x1f340; &#x1f339; &#x1f33b; &#x1f33a; &#x1f341; &#x1f343; &#x1f342; &#x1f33f; &#x1f344;&#x1f35d; &#x1f35b; &#x1f364; &#x1f4c3;个人主页 &#xff1a;阿然成长日记 …

Vue脚手架安装

安装包下载 安装包可以去官网下载&#xff08;官网地址&#xff09;&#xff0c;建议下载稳定版。 2. 选择安装目录 选择安装到一个&#xff0c;没有中文&#xff0c;没有空格的目录下&#xff08;新建一个文件夹NodeJS&#xff09; 3. 验证NodeJS环境变量 NodeJS 安装完…

AIGC大模型ChatGLM2-6B:国产版chatgpt本地部署及体验

1 ChatGLM2-6B介绍 ChatGLM是清华技术成果转化的公司智谱AI研发的支持中英双语的对话机器人。ChatGLM基于GLM130B千亿基础模型训练&#xff0c;它具备多领域知识、代码能力、常识推理及运用能力&#xff1b;支持与用户通过自然语言对话进行交互&#xff0c;处理多种自然语言任务…