开发环境
对于MacOS来说没无法cuda,所以用的是mps后端计算,所以也无法打印GPU数量和型号。
torch.backends.mps.is_available()
如果使用PopOS的话,可以选择自带英伟达驱动的镜像,这样就可以免去复杂的驱动安装流程。
EC2 (自己安装驱动)
sudo apt update
sudo apt install ubuntu-drivers-common -y
ubuntu-drivers devices
sudo apt install nvidia-driver-535-server -y
sudo modprobe nvidia
nvidia-smi
sudo apt install nvidia-cuda-toolkit
pip3 install torch torchvision torchaudio
这里提醒根卷一定要大,我这里安装完nvdia driver和nvcc以及pytorch之后使用了15G的磁盘空间。
EC2 (深度学习AMI)
亚马逊云内置了安装好驱动的AMI,这里我选择了Pytorch 2+的版本。
进入OS可以看到驱动已经安装完毕:
然后切换到虚拟环境使用Pytorch:
conda activate pytorch
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(1))
Sagemaker notebook
Sagemaker studio
TBD
N-docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh ./get-docker.sh --dry-run
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:24.01-py3
机器学习不止烧显卡嘛,还烧磁盘啊~
https://docs.docker.com/engine/install/ubuntu/
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker
https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html