Tejas Belagod <tejas.bela...@arm.com> writes: > On 4/10/25 5:13 PM, Tejas Belagod wrote: >> On 4/9/25 4:13 PM, Jakub Jelinek wrote: >>> On Wed, Apr 09, 2025 at 04:01:49PM +0530, Tejas Belagod wrote: >>>>> It also looks like there might be a missing "+" in simd_reduction: >>>>> >>>>> #pragma omp simd reduction (+:va, i) >>>>> for (j = 0; j < 16; j++) >>>>> va = svld1_s32 (svptrue_b32 (), a); >>>>> >>>>> res = svaddv_s32 (svptrue_b32 (), va); >>>>> >>>>> if (res != 8) >>>>> __builtin_abort (); >>>>> >>>>> since AFAICT the loop is not doing a reduction as things stand. >>>>> But perhaps that's deliberate, since it does match the != 8 test. >>>> >>>> That's interesting. I thought the reduction definition in the >>>> 'declare reduction' does the reduction from all the individual >>>> interations >>>> according the the rules defined in the reduction irrespective of the >>>> loop >>>> structure. Maybe 'va' doesn't become implicit private and causes a race >>>> (which may be why I didn't see it in my testing) - I'll try to repro >>>> this >>>> and have a look. >>> >>> No, reduction privatizes the variable, initializes the private >>> variable with >>> the initializer from UDR and reduces at the end only from all the private >>> variables to the original one. >>> >>> For simd, each SIMD lane has one private copy and there are # SIMD lanes >>> reductions into the original, for e.g. worksharing constructs each thread >>> has a private copy and there are omp_get_num_threads () reductions, etc. >>> >>> There is no special action at the end of each loop body, it is up to the >>> user to merge state from each iteration. In sane code the loop body does >>> similar operation to what the reduction does. >>> So say if you have >>> float sum = 0; >>> #pragma omp parallel for reduction (+:sum) >>> for (int i = 0; i < 1024; ++i) >>> sum += a[i]; >>> then if OpenMP pragmas are ignored, all the array members >>> are summed up in that order, while if say there are 4 >>> threads and each handles 256 iterations, then each thread will >>> start with sum = 0; and do sum += a[i]; for i omp_get_thread_num () * 256 >>> to omp_get_thread_num () * 256 + 255 inclusive and finally in some random >>> order the 4 private floats will be summed together. >>> For floating point that can result in different behavior (different >>> rounding >>> etc.) but user said it is ok like that. >>> >> >> Thanks for the explanation. I looked into why some of the tests may >> have failed - my flawed understanding of the reduction clause was why I >> didn't have the += in the loops - it might have passed for me as I >> probably hit the exact omp_get_num_threads () number required for the >> final += reductions to trigger from the declare clause. As Richard >> said, the += in the loops ought to fix the tests. I'm still analysing >> inscan_reduction_incl () to fix it properly. > > Sorry, forgot to ask - Richard, are you happy for me to work on a > separate patch to fix the test alongside your target +sve fix as a > couple more functions (eg simd_reduction () and inscan.. () ) also need > fixing?
Sure, that's fine with me. Should I apply the patch, or would you prefer to wait until your patch is ready? Richard