1,查询gpu型号:lspci | grep "NVIDIA\|VGA"
PCI Devices
2,下载驱动
官方驱动 | NVIDIA
3,安装
sudo sh NVIDIA-Linux-x86_64-440.118.02.run -no-x-check -no-nouveau-check -no-opengl-files
参数说明:
-no-x-check #安装驱动时关闭X服务
-no-nouveau-check #安装驱动时禁用nouveau
-no-opengl-files #只安装驱动文件,不安装OpenGL文件
4,查询GPU信息:nvidia-smi
5,nvidia-docker2安装
5.1 centos 在线安装
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum install -y nvidia-docker2
5.2 ubuntu 在线安装
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
5.3 离线安装
安装包位置:base/nvidia-docker2.tar.gz
5.3 配置/etc/docker/daemon.json【注意IP配置】
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia",
"insecure-registries": ["IP:5000"],
"registry-mirrors": ["USTC Open Source Software Mirror"]
}
5.4 重启docker:systemctl restart docker
6,vgpu插件安装【单张GPU跳过此步骤】
6.1 helm repo add vgpu-charts https://4paradigm.github.io/k8s-vgpu-scheduler
6.2 helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.19.9 -n kube-system
6.3 查看插件
kubectl get pods -n kube-system
7,其他相关配置
7.1 节点添加lable
问题处理:nvidia部分模块已经加载内核中的问题
ERROR: An NVIDIA kernel module ‘nvidia-uvm‘ appears to already be loaded in your kernel_an nvidia kernel module 'nvidia-uvm' appears to al-CSDN博客