On 4/9/25 4:13 PM, Jakub Jelinek wrote:
On Wed, Apr 09, 2025 at 04:01:49PM +0530, Tejas Belagod wrote:
It also looks like there might be a missing "+" in simd_reduction:

    #pragma omp simd reduction (+:va, i)
    for (j = 0; j < 16; j++)
      va = svld1_s32 (svptrue_b32 (), a);

    res = svaddv_s32 (svptrue_b32 (), va);

    if (res != 8)
      __builtin_abort ();

since AFAICT the loop is not doing a reduction as things stand.
But perhaps that's deliberate, since it does match the != 8 test.

That's interesting. I thought the reduction definition in the
'declare reduction' does the reduction from all the individual interations
according the the rules defined in the reduction irrespective of the loop
structure.  Maybe 'va' doesn't become implicit private and causes a race
(which may be why I didn't see it in my testing) - I'll try to repro this
and have a look.

No, reduction privatizes the variable, initializes the private variable with
the initializer from UDR and reduces at the end only from all the private
variables to the original one.

For simd, each SIMD lane has one private copy and there are # SIMD lanes
reductions into the original, for e.g. worksharing constructs each thread
has a private copy and there are omp_get_num_threads () reductions, etc.

There is no special action at the end of each loop body, it is up to the
user to merge state from each iteration.  In sane code the loop body does
similar operation to what the reduction does.
So say if you have
float sum = 0;
#pragma omp parallel for reduction (+:sum)
for (int i = 0; i < 1024; ++i)
   sum += a[i];
then if OpenMP pragmas are ignored, all the array members
are summed up in that order, while if say there are 4
threads and each handles 256 iterations, then each thread will
start with sum = 0; and do sum += a[i]; for i omp_get_thread_num () * 256
to omp_get_thread_num () * 256 + 255 inclusive and finally in some random
order the 4 private floats will be summed together.
For floating point that can result in different behavior (different rounding
etc.) but user said it is ok like that.


Thanks for the explanation. I looked into why some of the tests may have failed - my flawed understanding of the reduction clause was why I didn't have the += in the loops - it might have passed for me as I probably hit the exact omp_get_num_threads () number required for the final += reductions to trigger from the declare clause. As Richard said, the += in the loops ought to fix the tests. I'm still analysing inscan_reduction_incl () to fix it properly.

Thanks,
Tejas.

Reply via email to