CUDA: Incorrect linkage with -fgpu-rdc on kernels created from lambdas inside anonymous namespaces

**Summary**

In the example below, the PTX assembly generated by Clang declares the kernel as `.weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_`. The same example compiled with NVCC generates `.weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_968400945Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_`. In both cases, external weak linkage is used, which is not necessary since it is coming from inside a anonymous namespace. With NVCC, this is not a problem because the anonymous namespace is mangled to a unique name (`60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_96840094`). With Clang however, the name is not unique (mangled to [`12_GLOBAL__N_1`](https://github.com/llvm/llvm-project/blob/3251ba2d0fcf5223fce3e270b91c54f548664b4e/clang/lib/AST/ItaniumMangle.cpp#L1491)). This is a problem when passing the resulting object files to `nvlink`, which will report `nvlink fatal error: Internal error: duplicate parameter bank data not same size` or `nvlink error: Duplicate weak parameter bank for ...` depending on the CUDA version. To me, it seems like internal linkage (`.entry`) instead of weak linkage (`.weak .entry`) should be used in this case.

**Versions**

I reproduced the bug with multiple Clang versions between 12.0.0 and 14.0.0. Before abd8cd9199d1e14cae961e1067b78df7044179a3 by @yxsamliu, Clang would generate `.visible .entry` instead of `.weak .entry`, which isn't any better and actually causes the example below to fail earlier on `Multiple definition of '_ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv'()`.

**Potential workaround**

It seems like the Clang option `-funique-internal-linkage-names` should be usable as a workaround that forces the symbol names to be unique, however across all Clang versions this just gives me various internal compiler errors. But that's a different issue.

**Working example**

This example uses Thrust, which makes the symbol names very lengthy, but I am pretty sure the exact same behavior can also be observed by replacing `thrust::transform` with a hand-written kernel. The lambda capture seems to be important as just putting a kernel into an anonymous namespace is not sufficient to trigger the problem.

a.cu:
```cuda
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>

namespace {
    struct Stuff {
        long c[2];
    };

    struct Thing {
        void calc(int *data, int n, Stuff s) {
            auto f = [s] __device__ (int i) -> int {
                return 2*i;
            };
            
            auto first = thrust::counting_iterator<int>(0);
            auto last = first + n;
            thrust::transform(thrust::device, first, last, data, f);
        }

    };
}

void runA(int * data, int n) {
    Thing t;
    Stuff s({0, 0});
    t.calc(data, n, s);
}
```

b.cu:
```cuda
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>

namespace {
    struct Stuff {
        long c[4];
    };

    struct Thing {
        void calc(int *data, int n, Stuff s) {
            auto f = [s] __device__ (int i) -> int {
                return 2*i;
            };
            
            auto first = thrust::counting_iterator<int>(0);
            auto last = first + n;
            thrust::transform(thrust::device, first, last, data, f);
        }

    };
}

void runB(int * data, int n) {
    Thing t;
    Stuff s({0, 0, 0, 0});
    t.calc(data, n, s);
}
```

Compile and link these as follows:
```bash
clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 -fPIC -c a.cu -fgpu-rdc -o a.o
clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 -fPIC -c b.cu -fgpu-rdc -o b.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -dlink a.o b.o -o device.o
clang++ -std=c++14 -fPIC -O3 -shared -o ab.so a.o b.o device.o
```
The second-to-last command will fail with the nvlink error given in the summary.

For comparison, compile and link with NVCC:
```bash
nvcc -O3 -std=c++14 --expt-extended-lambda -gencode=arch=compute_70,code=[sm_70,compute_70] -Xcompiler -fPIC -c a.cu -dc -o a.o
nvcc -O3 -std=c++14 --expt-extended-lambda -gencode=arch=compute_70,code=[sm_70,compute_70] -Xcompiler -fPIC -c b.cu -dc -o b.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -dlink a.o b.o -o device.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -shared a.o b.o device.o -o ab.so
```
This will succeed.

**LLVM IR**

Unfortunately, Thrust is currently broken on clang trunk on godbolt.org, so I cannot post a link to the LLVM IR. Running clang locally, I can see that the IR uses `define weak_odr void @_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_(%...`. My guess is that it should probably use some variation of `private` or `internal` instead of `weak_odr`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Incorrect linkage with -fgpu-rdc on kernels created from lambdas inside anonymous namespaces #54560

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA: Incorrect linkage with -fgpu-rdc on kernels created from lambdas inside anonymous namespaces #54560

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions