[Bug c++/123272] [nvptx] OpenMP, libgomp, miscompilation in matrix multiplication with #pragma omp target teams parallel for collapse(2) if members of classes use templates on nvptx target

schulz.benjamin at googlemail dot com via Gcc-bugs Mon, 05 Jan 2026 02:59:48 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123272


Benjamin Schulz <schulz.benjamin at googlemail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |c++
             Target|nvptx                       |

--- Comment #5 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
I conversed with chatgpt a bit more. We first tried to insert debug statements
into omp-low.cc and tried patches, but that did not work. 

I then had the idea to look at other dumping files. 

The problem is already present in *. original.


without templates, gcc correctly puts the variable sum within the loops.

#pragma omp target device(dev)
    {
      {
        #pragma omp teams
          {
            {
              #pragma omp distribute collapse(2)
                {
                  {
                    #pragma omp parallel
                      {
                        {
                          {
                            {
                              size_t i;
                              size_t j;

                              #pragma omp for collapse(2) nowait
                                {
                                                                    size_t i;
                                                                    size_t j;
                                  for (i = 0; i < (long unsigned int) rows; 
++i)
                                    for (j = 0; j < (long unsigned int) cols; 
++j)
                                      {
                                        {
                                          double sum = 0.0;

                                          <<cleanup_point         




But i still dont understand why the loop variables are declared twice...

 With templates, gcc moves it higher between the loops:


  #pragma omp target device(dev)
    {
      {
        #pragma omp teams
          {
            {
              {
                #pragma omp distribute collapse(2)
                  {

                      {
                        #pragma omp parallel
                          {
                            {
                              {
                                size_t i;
                                size_t j;
                                double sum = 0.0;

                                #pragma omp for collapse(2) nowait
                                  {
                                                                        size_t
i;
                                                                        size_t
j;
                                    for (i = 0; i < (long unsigned int) rows; 
++i)
                                      for (j = 0; j < (long unsigned int) cols;
 ++j)
                                        {
                                          <<cleanup_point                      
                    double sum = 0.0;>>;


On the host, this does not seem to cause a problem, as it appears sum becomes
threadprivate.

But for the target, gcc then creates then a shared(sum) in front of it,
alleviating it from a thread-private to a shared variable, creating a data race
 in the inner sequential loop that updates sum

And now this:


If I change the c code from:


     #pragma omp target teams distribute  parallel for collapse(2) device(dev)
        for (size_t i = 0; i < rows; ++i)
            for (size_t j = 0; j < cols; ++j)

                T sum = T(0);
                for (size_t k = 0; k < inner_dim; ++k)
                {
                    sum += A.dpdata[i*Astr0+k*Astr1]
*B.dpdata[k*Bstr0+j*Bstr1];
                }
                C.dpdata[i*Cstr0+j*Cstr1]= sum;
            }

into that:

     #pragma omp target teams distribute  parallel for collapse(2) device(dev)
        for (size_t i = 0; i < rows; ++i)
{
            for (size_t j = 0; j < cols; ++j)

                T sum = T(0);
                for (size_t k = 0; k < inner_dim; ++k)
                {
                    sum += A.dpdata[i*Astr0+k*Astr1]
*B.dpdata[k*Bstr0+j*Bstr1];
                }
                C.dpdata[i*Cstr0+j*Cstr1]= sum;
            }
}


i.e. if I change 

#pragma omp target teams distribute  parallel for collapse(2) 
for(int i=0;i<10;i++)
 for(int j=0;j<10;j++)
 {T sum=T(0);}

into 

#pragma omp target teams distribute  parallel for collapse(2) 
for(int i=0;i<10;i++)
{ 
  for(j=0;j<10;j++)
  {T sum=T(0);} 
}


then suddenly, gcc can recognize that the variable sum should be declared
nested within the loops even for the template case...




So this is a problem for the frontend...

perhaps in 


cp/semantics.cc?

It occurs for gcc 14,15,16...

[Bug c++/123272] [nvptx] OpenMP, libgomp, miscompilation in matrix multiplication with #pragma omp target teams parallel for collapse(2) if members of classes use templates on nvptx target

Reply via email to