https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120814
Bug ID: 120814 Summary: gcc does not compute on nvptx device when the loop is given over a variable whose pointer was offloaded in another function.... Product: gcc Version: 15.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: schulz.benjamin at googlemail dot com Target Milestone: --- Created attachment 61703 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61703&action=edit mdspan class with offload support and array and vector support Hi there, the attached code, compiled with clang and -std=c++20 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wall yields this (correct) output: Ordinary matrix multiplication, on gpu 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Multiplication should start with a teams loop 80 90 100 110 176 202 228 254 272 314 356 398 368 426 484 542 while with gcc, and the options -fopenmp -foffload=nvptx-none -fno-stack-protector -Wall I get this: Ordinary matrix multiplication, on gpu 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Multiplication should start with a teams loop 0 0 0 0 0 0 0 0 0 0 0 0 What the code does: Most code is in mdspan_omp.h and contains a struct called datastruct with pointers that can be offloaded to gpu, called datastruct. This struct contains data, the extents of the tensor, and the strides. And some functions for accessing the multi dimensional tensor elements. mdspan_omp.h also contains a class called mdspan. It has constructors where it accepts arrays and numbers as extents, strides, and data. It contains the struct datastruct as a member variable and one can access this struct as a reference via a function get_datastruct() the class mdspan has functions for offloading to the gpu. At first, in the main function of main_omp.cpp, three matrices are defined and filled as mdspan objects and successfully printed out. Then a function called matrix_multiply_dot in mdspan_omp.h is called, which accepts three mdspan objects.. Then, they are offloaded to gpu. The two input matrices are offloaded with the functions called device_offload and the result with a function called device_alloc. These functions are member functions of mdspan, and make some checks and then they call the functions device_datastruct_upload, and device_datastruct_alloc that contain the mapping macros of openmp. The functions device_datastruct_upload device_datastruct_alloc then use the mapping macros to offload the reference of the datastruct member variable to gpu.. Then the get_datastruct() function is called by matrix_multiply to obtain a reference of the 3 datastruct structs. Then a loop for matrix multiplication follows as this, the variables dC,dA,dB are the references to the uploaded datastruct structs. the other variables are from the extents and strides of the mdspan classes...: #pragma omp target teams distribute parallel for collapse(2) shared(inner_dim, rows, cols,strA1,strA0,strB0,strB1,strC0,strC1)device(devicenum) for (size_t i = 0; i < rows; ++i) { for (size_t j = 0; j < cols; ++j) { T sum=0; #pragma omp simd reduction (+:sum) for (size_t k = 0; k < inner_dim; ++k) { sum+=dA(i,k,strA0,strA1)*dB(k,j,strB0,strB1); } dC.pdata[i*strC0+j*strC1]=sum; } } and then the member function host_update of the result C mdspan class is called. Finally, the result C is updated on the host, with the function update_host. One can test by defining a matrix, reserving space by device_alloc, and then and updating the host with the random values from the reservation, that the offloading works. The problem appears to be that the matrix multiplication is not carried out anymore on the device, despite this pragma: #pragma omp target teams distribute parallel for collapse(2) shared(inner_dim, rows, cols,strA1,strA0,strB0,strB1,strC0,strC1)device(devicenum) That seems to be because the references to the datastruct structs where offloaded with functions that are members of the mdspan classes A, B and C. Because, if I offload the datastruct references dA,dB,dC directly, by calling the globally available functions device_datastruct_upload device_datastruct_alloc, then, gcc and the openmp pragma can apparently associate them in the loop with the offloaded variables and result coincides with that of clang and is correct. But if device_datastruct_upload device_datastruct_alloc are called from a different function, which is a member of mdspan class A,B, and C and give the references of their members to device_datastruct_upload device_datastruct_alloc, then gcc can apparently not recognize that dA,dB,dC were offloaded.... It then computes apparently on the host. If I then do chose not to update and copy the data from the device to the host, I get the results. But they should be on the gpu, not on the host...