Description
Summary
In the example below, the PTX assembly generated by Clang declares the kernel as .weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_
. The same example compiled with NVCC generates .weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_968400945Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_
. In both cases, external weak linkage is used, which is not necessary since it is coming from inside a anonymous namespace. With NVCC, this is not a problem because the anonymous namespace is mangled to a unique name (60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_96840094
). With Clang however, the name is not unique (mangled to 12_GLOBAL__N_1
). This is a problem when passing the resulting object files to nvlink
, which will report nvlink fatal error: Internal error: duplicate parameter bank data not same size
or nvlink error: Duplicate weak parameter bank for ...
depending on the CUDA version. To me, it seems like internal linkage (.entry
) instead of weak linkage (.weak .entry
) should be used in this case.
Versions
I reproduced the bug with multiple Clang versions between 12.0.0 and 14.0.0. Before abd8cd9 by @yxsamliu, Clang would generate .visible .entry
instead of .weak .entry
, which isn't any better and actually causes the example below to fail earlier on Multiple definition of '_ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv'()
.
Potential workaround
It seems like the Clang option -funique-internal-linkage-names
should be usable as a workaround that forces the symbol names to be unique, however across all Clang versions this just gives me various internal compiler errors. But that's a different issue.
Working example
This example uses Thrust, which makes the symbol names very lengthy, but I am pretty sure the exact same behavior can also be observed by replacing thrust::transform
with a hand-written kernel. The lambda capture seems to be important as just putting a kernel into an anonymous namespace is not sufficient to trigger the problem.
a.cu:
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
namespace {
struct Stuff {
long c[2];
};
struct Thing {
void calc(int *data, int n, Stuff s) {
auto f = [s] __device__ (int i) -> int {
return 2*i;
};
auto first = thrust::counting_iterator<int>(0);
auto last = first + n;
thrust::transform(thrust::device, first, last, data, f);
}
};
}
void runA(int * data, int n) {
Thing t;
Stuff s({0, 0});
t.calc(data, n, s);
}
b.cu:
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
namespace {
struct Stuff {
long c[4];
};
struct Thing {
void calc(int *data, int n, Stuff s) {
auto f = [s] __device__ (int i) -> int {
return 2*i;
};
auto first = thrust::counting_iterator<int>(0);
auto last = first + n;
thrust::transform(thrust::device, first, last, data, f);
}
};
}
void runB(int * data, int n) {
Thing t;
Stuff s({0, 0, 0, 0});
t.calc(data, n, s);
}
Compile and link these as follows:
clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 -fPIC -c a.cu -fgpu-rdc -o a.o
clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 -fPIC -c b.cu -fgpu-rdc -o b.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -dlink a.o b.o -o device.o
clang++ -std=c++14 -fPIC -O3 -shared -o ab.so a.o b.o device.o
The second-to-last command will fail with the nvlink error given in the summary.
For comparison, compile and link with NVCC:
nvcc -O3 -std=c++14 --expt-extended-lambda -gencode=arch=compute_70,code=[sm_70,compute_70] -Xcompiler -fPIC -c a.cu -dc -o a.o
nvcc -O3 -std=c++14 --expt-extended-lambda -gencode=arch=compute_70,code=[sm_70,compute_70] -Xcompiler -fPIC -c b.cu -dc -o b.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -dlink a.o b.o -o device.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -shared a.o b.o device.o -o ab.so
This will succeed.
LLVM IR
Unfortunately, Thrust is currently broken on clang trunk on godbolt.org, so I cannot post a link to the LLVM IR. Running clang locally, I can see that the IR uses define weak_odr void @_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_(%...
. My guess is that it should probably use some variation of private
or internal
instead of weak_odr
.
Metadata
Metadata
Assignees
Type
Projects
Status