Description
The following code attached to this bugreport has no memory problems. but it fails with
Entirely on gpu
"PluginInterface" error: Failure to synchronize stream (nil): Error in cuStreamSynchronize: an illegal memory access was encountered
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
mdspan_acc.h:3807:9: omptarget fatal error 1: failure of target construct while offloading is mandatory
Process returned -1 (0xFFFFFFFF) execution time : 0.394 s
Upon investigating, it seems not to offload dQ and dR correctly in the function qr_decomposition in line 3792 which calls
create_in_struct(dA);
create_out_struct(dQ);
create_out_struct(dR);
that do the mappings with
template
void inline create_in_struct(const datastruct& dA)
{
#pragma omp target enter data map(to: dA,dA.pdata[0:dA.pdatalength],dA.pextents[0:dA.prank],dA.pstrides[0:dA.prank])
}
template
void inline create_out_struct(datastruct& dA)
{
#pragma omp target enter data map(to: dA) map(alloc: dA.pdata[0:dA.pdatalength]) map(to:dA.pextents[0:dA.prank],dA.pstrides[0:dA.prank])
}
template
and then calls
gpu_qr_decomposition.
If one removes the lines 1863 -1997 in gpu_qr_decomposition, especially these lines:
#pragma omp parallel for
for (size_t i=0; i<Q.pdatalength; i++)
{
Q.pdata[i]=0;
}
//
//
#pragma omp parallel for
for (size_t i=0; i<R.pdatalength; i++)
{
R.pdata[i]=0;
}
then the code suddenly compiles...
The strange problem is that before, similar code is called for a cholesky and an lu decomposition.
here, the code works provided i compile it without optimization.
if i compile the code with -O2, then clang takes veeeeeeerry long to finish, and for the result, the lu decomposition crashes too...
There is no problem with the array sizes, as I have checked these, and these are just test cases with 9 elements (3x3 matrix).