在复现一个开源项目https://github.com/macanv/BERT-BiLSTM-CRF-NER。有一个疑似TensorFlow-gpu、cudnn、cuda之间版本不兼容的问题。问题详情如下:
- 在base中输入nvidia-smi显示无此命令:
输入nvitop可正常显示:
输入nvcc -V显示为:
可得显卡驱动版本为470.199.02,cuda版本为11.4 - 环境中各包的版本为:
Tensorflow的版本是根据git项目中的readme设置的,不好轻易更改。上面的cudatoolkit和cudnn是运行命令tensorflow-gpu==1.12.0时自动安装的。查询得知版本依赖如下:
不知是否存在版本对应错误问题? - 主要症状
原封不动地将项目下载到本地,第一次运行程序出现如下报错:
totalMemory: 23.70GiB freeMemory: 23.45GiB
2024-07-01 14:45:52.995573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2024-07-01 14:46:32.655609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-07-01 14:46:32.655637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2024-07-01 14:46:32.655643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2024-07-01 14:46:32.655769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22732 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:5e:00.0, compute capability: 8.6)
2024-07-01 14:47:35.319078: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
2024-07-01 14:47:35.321609: I tensorflow/stream_executor/stream.cc:2076] [stream=0x137c5e60,impl=0x137c5f00] did not wait for [stream=0x17edbb60,impl=0x1378c680]
2024-07-01 14:47:35.321668: I tensorflow/stream_executor/stream.cc:5011] [stream=0x137c5e60,impl=0x137c5f00] did not memcpy device-to-host; source: 0x7fd8d8251400
2024-07-01 14:47:35.321761: F tensorflow/core/common_runtime/gpu/gpu_util.cc:292] GPU->CPU Memcpy failed
第二次运行程序则出现如下报错:
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(202, 2), b.shape=(2, 768), m=202, n=768, k=2
[[node bert/embeddings/MatMul (defined at /home/dell/下载/enter/envs/TY_NER_tf1/lib/python3.6/site-packages/bert_base-0.0.9py3.6.egg/bert_base/bert/modeling.py:486) = MatMul[T=DTLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/embeddings/one_hot, bert/embeddings/token_type_embeddings/read)]]
[[{{node crf_loss/Mean/_4075}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3726_crf_loss/Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
可以确定的是,显存充足(24G),batch-size足够小(调整为1依然报错),重启不能解决问题,程序没有错误(别人能够成功复现)