1, hip cuda kernel 编译概观
编译的文件流:
.hip kernel --(clang++)--> .o
.o --(lld)--> .out
.out --(clang-offload-bundler)--> .hipfb
2,示例 hipcc -###
代码:
__global__ void WWWWW()
{
((int*)0x8888888)[3] = 0x77777;
}
操作过程:
$ hipcc -### param_00.hip --cuda-device-only --offload-arch=gfx906
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-6.0.2/llvm/bin
Configuration file: /opt/rocm-6.0.2/lib/llvm/bin/clang++.cfg
clang: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-unwindlib=libgcc' [-Wunused-command-line-argument]
"/opt/rocm-6.0.2/lib/llvm/bin/clang-17" \
"-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" \
"x86_64-unknown-linux-gnu" "-emit-obj" "-disable-free" \
"-clear-ast-before-backend" "-disable-llvm-verifier" \
"-discard-value-names" "-main-file-name" "param_00.hip" \
"-mrelocation-model" "pic" "-pic-level" "2" "-fhalf-no-semantic-interposition" \
"-mframe-pointer=none" "-fno-rounding-math" "-mconstructor-aliases" "-aux-target-cpu" \
"x86-64" "-fcuda-is-device" "-mllvm" "-amdgpu-internalize-symbols" "-fcuda-allow-variadic-functions" \
"-fvisibility=hidden" "-fapply-global-visibility-to-externs" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/hip.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ocml.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ockl.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_daz_opt_off.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_finite_only_off.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_wavefrontsize64_on.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_isa_version_906.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_abi_version_500.bc" \
"-target-cpu" "gfx906" \
"-debugger-tuning=gdb" "-resource-dir" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0" \
"-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include/cuda_wrappers" \
"-idirafter" "/opt/rocm-6.0.2/lib/llvm/bin/../../../include" "-include" \
"__clang_hip_runtime_wrapper.h" "-c-isystem" "/opt/rocm-6.0.2/llvm/include/gpu-none-llvm" \
"-isystem" "/opt/rocm-6.0.2/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
"-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
"-internal-isystem" "/usr/local/include" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
"-internal-externc-isystem" "/usr/include" "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
"-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
"-internal-externc-isystem" "/usr/include" "-O3" "-fdeprecated-macro" "-fno-autolink" \
"-fdebug-compilation-dir=/home/hipper/ex_amd_gpu_compiler/ex/parameters_hip/param_00" \
"-ferror-limit" "19" "-fhip-new-launch-api" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" \
"-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all=true" "-mllvm" \
"-amdgpu-function-calls=false" "-cuid=3e1885b9958b336f" "-fcuda-allow-variadic-functions" "-faddrsig" \
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" \
"-o" "/tmp/param_00-gfx906-a7e858.o" \
"-x" "hip" "param_00.hip"
"/opt/rocm-6.0.2/llvm/bin/lld" \
"-flavor" "gnu" "-m" "elf64_amdgpu" \
"--no-undefined" "-shared" \
"-plugin-opt=-amdgpu-internalize-symbols" \
"-plugin-opt=mcpu=gfx906" \
"-plugin-opt=O3" "--lto-CGO3" \
"-plugin-opt=-amdgpu-early-inline-all=true" \
"-plugin-opt=-amdgpu-function-calls=false" \
"--whole-archive" \
"-o" "/tmp/param_00-gfx906-65b179.out" \
"/tmp/param_00-gfx906-a7e858.o" \
"--no-whole-archive"
"/opt/rocm-6.0.2/llvm/bin/clang-offload-bundler" \
"-type=o" "-bundle-align=4096" \
"-targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx906" \
"-input=/dev/null" \
"-input=/tmp/param_00-gfx906-65b179.out" \
"-output=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb"
涉及到了三个命令:
clang++ -o xxx.o
lld -o xxx.out
clang-offload-bundler -output=xxx.hipfb
3,分析中间文件
3.1 clang++ 编译生成的 .o
.o 是一个elf文件
这个.o 是使用自己编译出来的clang++ 编译的,
其中,将 cp /opt/rocm/bin/clang++.cfg local_amdgpu/llvm/bin/
稍作路径调整,编译生成 .o :
$ "../../../../local_amdgpu/bin/clang-19" \
"-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" \
"x86_64-unknown-linux-gnu" "-emit-obj" "-disable-free" \
"-clear-ast-before-backend" "-disable-llvm-verifier" \
"-discard-value-names" "-main-file-name" "param_00.hip" \
"-mrelocation-model" "pic" "-pic-level" "2" "-fhalf-no-semantic-interposition" \
"-mframe-pointer=none" "-fno-rounding-math" "-mconstructor-aliases" "-aux-target-cpu" \
"x86-64" "-fcuda-is-device" "-mllvm" "-amdgpu-internalize-symbols" "-fcuda-allow-variadic-functions" \
"-fvisibility=hidden" "-fapply-global-visibility-to-externs" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/hip.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ocml.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ockl.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_daz_opt_off.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_finite_only_off.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_wavefrontsize64_on.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_isa_version_906.bc" \
"-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_abi_version_500.bc" \
"-target-cpu" "gfx906" \
"-debugger-tuning=gdb" "-resource-dir" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0" \
"-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include/cuda_wrappers" \
"-idirafter" "/opt/rocm-6.0.2/lib/llvm/bin/../../../include" "-include" \
"__clang_hip_runtime_wrapper.h" "-c-isystem" "/opt/rocm-6.0.2/llvm/include/gpu-none-llvm" \
"-isystem" "/opt/rocm-6.0.2/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
"-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
"-internal-isystem" "/usr/local/include" \
"-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
"-internal-externc-isystem" "/usr/include" "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
"-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
"-internal-externc-isystem" "/usr/include" "-O3" "-fdeprecated-macro" "-fno-autolink" \
"-fdebug-compilation-dir=/home/hipper/ex_amd_gpu_compiler/ex/parameters_hip/param_00" \
"-ferror-limit" "19" "-fhip-new-launch-api" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" \
"-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all=true" "-mllvm" \
"-amdgpu-function-calls=false" "-cuid=3e1885b9958b336f" "-fcuda-allow-variadic-functions" "-faddrsig" \
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" \
"-o" "./param_00-gfx906-a7e858.o" \
"-x" "hip" "param_00.hip"
3.2 lld 链接得到的.out 文件
也属于 elf 文件,但是是DYN (Shared object file) 类型,不再是 relocationable 类型。
其生成命令也是稍作了路径改动:
"../../../../local_amdgpu/bin/lld" \
"-flavor" "gnu" "-m" "elf64_amdgpu" \
"--no-undefined" "-shared" \
"-plugin-opt=-amdgpu-internalize-symbols" \
"-plugin-opt=mcpu=gfx906" \
"-plugin-opt=O3" "--lto-CGO3" \
"-plugin-opt=-amdgpu-early-inline-all=true" \
"-plugin-opt=-amdgpu-function-calls=false" \
"--whole-archive" \
"-o" "./param_00-gfx906-65b179.out" \
"./param_00-gfx906-a7e858.o" \
"--no-whole-archive"
3.3 hipfb 文件
这个类型的文件 是由 clang-offload-bundler 打包而成,这里没有什么新意,对 clang-offload-bundler做一个介绍:
clang-offload-bundler 是一个工具,它是 Clang/LLVM 编译器工具链的一部分,用于支持在多种设备上进行异构计算。这个工具主要用于处理和打包不同目标设备(如 CPU、GPU、FPGA 等)的代码,以便在一个单一的程序中支持多种计算设备。这种技术通常用于加速应用程序,特别是在需要大量并行处理的场景中。
3.3.1 生成 hipfb 文件的方法
"../../../../local_amdgpu/bin/clang-offload-bundler" \
"-type=o" "-bundle-align=4096" \
"-targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx906" \
"-input=/dev/null" \
"-input=./param_00-gfx906-65b179.out" \
"-output=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb"
3.3.2 主要功能和作用
1. 打包和解包代码对象:
- clang-offload-bundler 能够将针对不同设备的代码对象(如 CPU 和 GPU 的代码)打包到一个单一的文件中。这使得管理和分发针对异构计算平台的应用程序变得更加简单。
- 同样,它也可以从这种打包的文件中提取特定目标设备的代码对象,以便在适当的设备上执行。
2. 支持异构编程:
- 在异构编程中,开发者可能需要将程序的不同部分优化并编译到不同的硬件平台上。clang-offload-bundler 通过管理这些不同的代码段,简化了构建和部署过程。
3. 简化编译和链接流程:
- 在使用 OpenMP 或 CUDA 等并行编程模型时,clang-offload-bundler 能够处理主机代码和加速器代码之间的交互,包括数据传输和执行控制。这样,开发者可以更专注于代码的并行部分,而不是底层的数据管理和设备控制。
4. 提高性能和可移植性:
- 通过允许代码针对特定硬件进行优化,clang-offload-bundler 帮助提高应用程序的性能。同时,它也支持代码的可移植性,因为同一个应用程序可以针对多种硬件平台进行编译和打包。
3.3.3 使用场景
- 并行计算应用:在需要大量计算资源的应用中,如科学计算、图像处理、机器学习等,clang-offload-bundler 可以帮助开发者有效地利用多种计算资源。
- 开发跨平台应用:对于需要在多种硬件设备上运行的软件,如桌面和移动设备,或者 CPU 和 GPU,clang-offload-bundler 提供了一种统一的方式来处理不同平台的代码。
意义:clang-offload-bundler 还是挺强大的,用于支持和简化异构计算环境中的编程和部署过程。它通过管理针对不同硬件平台的代码,使得开发高性能并行应用程序变得更加高效和可行。
3.3.4 解析 hipfb的方法
clang-offload-bundler -type=o -targets=hip-amdgcn-amd-amdhsa--gfx906 -input=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb -output=device_output.o -unbundle
于是又得到了 bundle前的out 文件:
甚至连文件大小都一样: