基于nsight_compute进行kernel性能分析

当利用nsight进行性能分析时，当涉及到内核级别的性能分析时，nvidia提供了系统级别(nsight system)，和内核级别的性能分析工具（nsight compute）。

其中，内核级别的性能分析，可以检查kernel实现的性能好坏及bottleneck。

本次分析参考的kernel实现代码为：

https://github.com/siboehm/SGEMM_CUDA

本次主要对比的是cublas和基于naive cuda实现的kernel性能对比：
cublas的实现可以理解为最优解，而naive cuda的kernel没有进行，global memory。共享内存或者寄存器相关的优化。

参考下列执行执行nsight compute的分析。。。

step 1: enble the usage of ncu

based on the first reference

step 2; generate report
```
ncu -o profile_matrix --set full ./a.out

```

step 3: use nsight compute to watch report

Reference:

[1] https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters

[2] https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html

[3] https://www.bilibili.com/video/BV15P4y1R7VG/?share_source=copy_web&vd_source=afbf8b20dbc63173f95b2d83f262a108

分析完后，利用系统安装的nsight compute工具进行加载。

可以看到，两个kernel的实现，在block size，grid size上都有所区别。但是，在具体性能上，两者的差别更大，包括计算周期，时间，以及compute throughout 这些参数。

基于这些数据的分析可以发现：

数据传输的throughput相近，但是compute throughput的差别很大，这说明，算力的使用还有很大的优化空间，并且不是因为memory-bound的原因。

我们再来观察核函数的具体参数，对应的就是sgemm_naive,和 volta_sgemm_128x64_nn

Floating Point operations roofline

基于这两张图的对比，可以看到，naive的kernel远远没有达到最优的kernel利用率，并且，纵坐标是指数级别的，所以实际差距更大。

Arithmetic intensity

针对naive的kernel，SM busy的百分比只有3.14%。

下图展示了FMA，ALU的利用率同样差别显著。

memory workload

值得注意的是，即使是针对cublas的实现，nsight compute也不满意，下面是评估看法

The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only accesses an average of 2.1 sectors out of the possible 4 sectors per cache line. Check the  Source Counters section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory request.

从上面图的对比可以看到

1. 从global memory传入kernel或者L1 cache中的指令数量存在明显差别

2，优化的kernel中包含对于共享内存的有效利用

小节

nsight compute为我们提供了详细的内核使用的性能评估工具，但是，理解众多参数的作用，还是需要理解GPU的硬件架构和运作原理，此外，针对cuda 编程，更应该做的就是理解 shared memeory，register的工作方式，从而在写kernel时，就提供好的代码实现，避免之后对于profiler的过度依赖。

参考链接：

1. https://www.alcf.anl.gov/sites/default/files/2022-10/nvidia_profiling_tools_keipert_10_4_22.pdf

2. 2. Kernel Profiling Guide — NsightCompute 12.5 documentation

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/735297.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！