tf1问题记录

在复现一个开源项目https://github.com/macanv/BERT-BiLSTM-CRF-NER。有一个疑似TensorFlow-gpu、cudnn、cuda之间版本不兼容的问题。问题详情如下：

在base中输入nvidia-smi显示无此命令：

输入nvitop可正常显示：

输入nvcc -V显示为：

可得显卡驱动版本为470.199.02，cuda版本为11.4
环境中各包的版本为：

Tensorflow的版本是根据git项目中的readme设置的，不好轻易更改。上面的cudatoolkit和cudnn是运行命令tensorflow-gpu==1.12.0时自动安装的。查询得知版本依赖如下：

不知是否存在版本对应错误问题？
主要症状
原封不动地将项目下载到本地，第一次运行程序出现如下报错：

totalMemory: 23.70GiB freeMemory: 23.45GiB
2024-07-01 14:45:52.995573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2024-07-01 14:46:32.655609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-07-01 14:46:32.655637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2024-07-01 14:46:32.655643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2024-07-01 14:46:32.655769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22732 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:5e:00.0, compute capability: 8.6)
2024-07-01 14:47:35.319078: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
2024-07-01 14:47:35.321609: I tensorflow/stream_executor/stream.cc:2076] [stream=0x137c5e60,impl=0x137c5f00] did not wait for [stream=0x17edbb60,impl=0x1378c680]
2024-07-01 14:47:35.321668: I tensorflow/stream_executor/stream.cc:5011] [stream=0x137c5e60,impl=0x137c5f00] did not memcpy device-to-host; source: 0x7fd8d8251400
2024-07-01 14:47:35.321761: F tensorflow/core/common_runtime/gpu/gpu_util.cc:292] GPU->CPU Memcpy failed

第二次运行程序则出现如下报错：

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(202, 2), b.shape=(2, 768), m=202, n=768, k=2
         [[node bert/embeddings/MatMul (defined at /home/dell/下载/enter/envs/TY_NER_tf1/lib/python3.6/site-packages/bert_base-0.0.9py3.6.egg/bert_base/bert/modeling.py:486) = MatMul[T=DTLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/embeddings/one_hot, bert/embeddings/token_type_embeddings/read)]]
         [[{{node crf_loss/Mean/_4075}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3726_crf_loss/Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

可以确定的是，显存充足（24G），batch-size足够小（调整为1依然报错），重启不能解决问题，程序没有错误（别人能够成功复现）

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/761895.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！