单机多卡训练大模型的时候,突然报错:
3%|▎ | 146/4992 [2:08:21<72:57:12, 54.20s/it][2024-05-10 13:27:11,479] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGHUP death signal, shutting down workers
[2024-05-10 13:27:11,481] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46635 closing signal SIGHUP
[2024-05-10 13:27:11,481] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46636 closing signal SIGHUP
[2024-05-10 13:27:11,481] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46637 closing signal SIGHUP
Traceback (most recent call last):
File "/home/wangguisen/miniconda3/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 868, in _invoke_run
time.sleep(monitor_interval)
File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 46460 got signal: 1
我的环境:
- torch:2.2
- Python:3.10
我是以 nohup 的形式后台运行的:
nohup sh ./dk/multi_run_demo.sh &
看了issues后,我们使用 screen 或 tmux 代替,具体是为什么大家可以看一下下面的链接:
- 下载 tmux:
sudo apt-get install tmux # ubuntu
sudo yum install tmux # centos
在命令行输入 tmux
即可进入 tmux 的界面,其常用命令如下:
-
查看当前全部的tmux会话:
tmux ls
-
新建会话:
tmux new -s [会话名字]
-
分离会话并回到原始界面:
tmux detach
-
重新进入会话:
- 按照编号:
tmux attach -t 0
- 按照名字:
tmux attach -t [会话名字]
- 按照编号:
-
kill会话:
- 按照编号:
tmux kill-session -t 0
- 按照名字:
tmux kill-session -t [会话名字]
- 或者会话内直接输入
exit
- 按照编号:
-
使用快捷键需要先按一下
ctrl(control)+b
,然后:- 快捷键帮助:ctrl+b,然后按一下?,最后按Esc退出
- 查看当前全部的tmux会话:ctrl+b,然后按一下s
- 分离会话并回到原始界面:ctrl+b,然后按一下d
- 后台运行
我们的 shell 脚本内容如下:
CUDA_VISIBLE_DEVICES=0,1,2 accelerate launch \
--config_file yamls/accelerate_single_config.yaml \
src/train.py yamls/qwen_lora_sft_multi_gpu_demo.yaml \
> weights/run.log 2>&1
脚本名为:multi_run_demo.sh,给它权限:
chmod +x ./dk/multi_run_demo.sh
然后新建一个 tmux 会话:
tmux new -s multi_run_demo
然后在新的窗口中运行:
./dk/multi_run_demo.sh
此时我们的程序已经运行了:
利用快捷键分离页面,回到我们的原始页面:ctrl(control)+b
,然后d
:
要是想看运行情况,也可以:
tmux attach -t multi_run_demo
另外,可以使用 accelerate 的
--main_process_port XXX
重新指定端口号
ref:
https://github.com/pytorch/pytorch/issues/76894
https://github.com/hiyouga/ChatGLM-Efficient-Tuning/issues/72