Skip to content

CUDA: Incorrect linkage with -fgpu-rdc on kernels created from lambdas inside anonymous namespaces #54560

Closed
@mkuron

Description

@mkuron

Summary

In the example below, the PTX assembly generated by Clang declares the kernel as .weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_. The same example compiled with NVCC generates .weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_968400945Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_. In both cases, external weak linkage is used, which is not necessary since it is coming from inside a anonymous namespace. With NVCC, this is not a problem because the anonymous namespace is mangled to a unique name (60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_96840094). With Clang however, the name is not unique (mangled to 12_GLOBAL__N_1). This is a problem when passing the resulting object files to nvlink, which will report nvlink fatal error: Internal error: duplicate parameter bank data not same size or nvlink error: Duplicate weak parameter bank for ... depending on the CUDA version. To me, it seems like internal linkage (.entry) instead of weak linkage (.weak .entry) should be used in this case.

Versions

I reproduced the bug with multiple Clang versions between 12.0.0 and 14.0.0. Before abd8cd9 by @yxsamliu, Clang would generate .visible .entry instead of .weak .entry, which isn't any better and actually causes the example below to fail earlier on Multiple definition of '_ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv'().

Potential workaround

It seems like the Clang option -funique-internal-linkage-names should be usable as a workaround that forces the symbol names to be unique, however across all Clang versions this just gives me various internal compiler errors. But that's a different issue.

Working example

This example uses Thrust, which makes the symbol names very lengthy, but I am pretty sure the exact same behavior can also be observed by replacing thrust::transform with a hand-written kernel. The lambda capture seems to be important as just putting a kernel into an anonymous namespace is not sufficient to trigger the problem.

a.cu:

#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>

namespace {
    struct Stuff {
        long c[2];
    };

    struct Thing {
        void calc(int *data, int n, Stuff s) {
            auto f = [s] __device__ (int i) -> int {
                return 2*i;
            };
            
            auto first = thrust::counting_iterator<int>(0);
            auto last = first + n;
            thrust::transform(thrust::device, first, last, data, f);
        }

    };
}

void runA(int * data, int n) {
    Thing t;
    Stuff s({0, 0});
    t.calc(data, n, s);
}

b.cu:

#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>

namespace {
    struct Stuff {
        long c[4];
    };

    struct Thing {
        void calc(int *data, int n, Stuff s) {
            auto f = [s] __device__ (int i) -> int {
                return 2*i;
            };
            
            auto first = thrust::counting_iterator<int>(0);
            auto last = first + n;
            thrust::transform(thrust::device, first, last, data, f);
        }

    };
}

void runB(int * data, int n) {
    Thing t;
    Stuff s({0, 0, 0, 0});
    t.calc(data, n, s);
}

Compile and link these as follows:

clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 -fPIC -c a.cu -fgpu-rdc -o a.o
clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 -fPIC -c b.cu -fgpu-rdc -o b.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -dlink a.o b.o -o device.o
clang++ -std=c++14 -fPIC -O3 -shared -o ab.so a.o b.o device.o

The second-to-last command will fail with the nvlink error given in the summary.

For comparison, compile and link with NVCC:

nvcc -O3 -std=c++14 --expt-extended-lambda -gencode=arch=compute_70,code=[sm_70,compute_70] -Xcompiler -fPIC -c a.cu -dc -o a.o
nvcc -O3 -std=c++14 --expt-extended-lambda -gencode=arch=compute_70,code=[sm_70,compute_70] -Xcompiler -fPIC -c b.cu -dc -o b.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -dlink a.o b.o -o device.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -shared a.o b.o device.o -o ab.so

This will succeed.

LLVM IR

Unfortunately, Thrust is currently broken on clang trunk on godbolt.org, so I cannot post a link to the LLVM IR. Running clang locally, I can see that the IR uses define weak_odr void @_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_(%.... My guess is that it should probably use some variation of private or internal instead of weak_odr.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions