https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123272
Benjamin Schulz <schulz.benjamin at googlemail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|target |c++
Target|nvptx |
--- Comment #5 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
I conversed with chatgpt a bit more. We first tried to insert debug statements
into omp-low.cc and tried patches, but that did not work.
I then had the idea to look at other dumping files.
The problem is already present in *. original.
without templates, gcc correctly puts the variable sum within the loops.
#pragma omp target device(dev)
{
{
#pragma omp teams
{
{
#pragma omp distribute collapse(2)
{
{
#pragma omp parallel
{
{
{
{
size_t i;
size_t j;
#pragma omp for collapse(2) nowait
{
size_t i;
size_t j;
for (i = 0; i < (long unsigned int) rows;
++i)
for (j = 0; j < (long unsigned int) cols;
++j)
{
{
double sum = 0.0;
<<cleanup_point
But i still dont understand why the loop variables are declared twice...
With templates, gcc moves it higher between the loops:
#pragma omp target device(dev)
{
{
#pragma omp teams
{
{
{
#pragma omp distribute collapse(2)
{
{
#pragma omp parallel
{
{
{
size_t i;
size_t j;
double sum = 0.0;
#pragma omp for collapse(2) nowait
{
size_t
i;
size_t
j;
for (i = 0; i < (long unsigned int) rows;
++i)
for (j = 0; j < (long unsigned int) cols;
++j)
{
<<cleanup_point
double sum = 0.0;>>;
On the host, this does not seem to cause a problem, as it appears sum becomes
threadprivate.
But for the target, gcc then creates then a shared(sum) in front of it,
alleviating it from a thread-private to a shared variable, creating a data race
in the inner sequential loop that updates sum
And now this:
If I change the c code from:
#pragma omp target teams distribute parallel for collapse(2) device(dev)
for (size_t i = 0; i < rows; ++i)
for (size_t j = 0; j < cols; ++j)
T sum = T(0);
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1]
*B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
into that:
#pragma omp target teams distribute parallel for collapse(2) device(dev)
for (size_t i = 0; i < rows; ++i)
{
for (size_t j = 0; j < cols; ++j)
T sum = T(0);
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1]
*B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
}
i.e. if I change
#pragma omp target teams distribute parallel for collapse(2)
for(int i=0;i<10;i++)
for(int j=0;j<10;j++)
{T sum=T(0);}
into
#pragma omp target teams distribute parallel for collapse(2)
for(int i=0;i<10;i++)
{
for(j=0;j<10;j++)
{T sum=T(0);}
}
then suddenly, gcc can recognize that the variable sum should be declared
nested within the loops even for the template case...
So this is a problem for the frontend...
perhaps in
cp/semantics.cc?
It occurs for gcc 14,15,16...