https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120814

            Bug ID: 120814
           Summary: gcc does not compute on nvptx device when the loop is
                    given over a variable whose pointer was offloaded in
                    another function....
           Product: gcc
           Version: 15.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schulz.benjamin at googlemail dot com
  Target Milestone: ---

Created attachment 61703
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61703&action=edit
mdspan class with offload support and array and vector support

Hi there, the attached code, compiled with clang and 

 -std=c++20  -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda  -Wall

yields this (correct) output:

Ordinary matrix multiplication, on gpu
1 2 3 4 
5 6 7 8 
9 10 11 12 
13 14 15 16 

0 1 2 3 
4 5 6 7 
8 9 10 11 
12 13 14 15 

1 1 1 1 
1 1 1 1 
1 1 1 1 
1 1 1 1 

Multiplication should start with a teams loop 
80 90 100 110 
176 202 228 254 
272 314 356 398 
368 426 484 542 


while with gcc, and the options
 -fopenmp -foffload=nvptx-none  -fno-stack-protector -Wall
 I get this:

Ordinary matrix multiplication, on gpu
1 2 3 4 
5 6 7 8 
9 10 11 12 
13 14 15 16 

0 1 2 3 
4 5 6 7 
8 9 10 11 
12 13 14 15 

1 1 1 1 
1 1 1 1 
1 1 1 1 
1 1 1 1 

Multiplication should start with a teams loop 
0 0 0 0 
0 0 0 0 
0 0 0 0 

What the code does:

Most code is in mdspan_omp.h and contains a struct called datastruct with
pointers that can be offloaded to gpu, called datastruct. This struct contains
data, the extents of the tensor, and the strides. And some functions for
accessing the multi dimensional tensor elements.

mdspan_omp.h also contains a class called mdspan. It has constructors where it
accepts arrays and numbers as extents, strides, and data. It contains the
struct  datastruct as a member variable and one can access this struct as a
reference via a function get_datastruct()

the class mdspan has functions for offloading to the gpu.

At first, in the main function of main_omp.cpp, three matrices are defined and
filled as mdspan objects and successfully printed out.

Then a function called matrix_multiply_dot in mdspan_omp.h is called, which
accepts three mdspan objects..


Then, they are offloaded to gpu. The two input matrices are offloaded  with the
functions called device_offload and the result with a function called
device_alloc. 

These functions are member functions of mdspan, and make some checks and then
they call the functions device_datastruct_upload, and device_datastruct_alloc
that contain the mapping macros of openmp. The functions
device_datastruct_upload device_datastruct_alloc then use the mapping macros to
offload the reference of the datastruct member variable to gpu..

Then the get_datastruct() function is called by matrix_multiply to obtain a
reference of the 3 datastruct structs.


Then a loop for matrix multiplication follows as this, the variables dC,dA,dB
are the references to the uploaded datastruct structs. the other variables are
from the extents and strides of the mdspan classes...:

       #pragma omp target teams distribute parallel for collapse(2) 
shared(inner_dim, rows,
cols,strA1,strA0,strB0,strB1,strC0,strC1)device(devicenum)
        for (size_t i = 0; i < rows; ++i)
        {
            for (size_t j = 0; j < cols; ++j)
            {
                T sum=0;
                #pragma omp simd reduction (+:sum)
                for (size_t k = 0; k < inner_dim; ++k)
                {
                    sum+=dA(i,k,strA0,strA1)*dB(k,j,strB0,strB1);
                }
               dC.pdata[i*strC0+j*strC1]=sum;
            }
        }


and then the member function host_update of the result C mdspan class is
called.

Finally, the result C is updated on the host, with the function update_host.

One can test by defining a matrix, reserving space by device_alloc, and then
and updating the host with the random values from the reservation, that the
offloading works.


The problem appears to be that the matrix multiplication is not carried out
anymore on the device,  despite this pragma:

       #pragma omp target teams distribute parallel for collapse(2) 
shared(inner_dim, rows,
cols,strA1,strA0,strB0,strB1,strC0,strC1)device(devicenum)


That seems to be because the references to the datastruct structs where
offloaded with functions that are members of the mdspan classes A, B and C.

Because, if I offload the datastruct references dA,dB,dC directly, by calling
the globally available functions device_datastruct_upload
device_datastruct_alloc, then, gcc and the openmp pragma can apparently
associate them in the loop with the offloaded variables and result coincides
with that of clang and is correct.

But if  device_datastruct_upload device_datastruct_alloc are called from a
different function, which is a member of mdspan class A,B, and C and give the
references of their members to  device_datastruct_upload
device_datastruct_alloc, then gcc can apparently not recognize that dA,dB,dC
were offloaded....

It then computes apparently on the host. If I then do chose not to update and
copy the data from the device to the host, I get the results.

But they should be on the gpu, not on the host...

Reply via email to