Tejas Belagod <tejas.bela...@arm.com> writes:
> On 4/10/25 5:13 PM, Tejas Belagod wrote:
>> On 4/9/25 4:13 PM, Jakub Jelinek wrote:
>>> On Wed, Apr 09, 2025 at 04:01:49PM +0530, Tejas Belagod wrote:
>>>>> It also looks like there might be a missing "+" in simd_reduction:
>>>>>
>>>>>     #pragma omp simd reduction (+:va, i)
>>>>>     for (j = 0; j < 16; j++)
>>>>>       va = svld1_s32 (svptrue_b32 (), a);
>>>>>
>>>>>     res = svaddv_s32 (svptrue_b32 (), va);
>>>>>
>>>>>     if (res != 8)
>>>>>       __builtin_abort ();
>>>>>
>>>>> since AFAICT the loop is not doing a reduction as things stand.
>>>>> But perhaps that's deliberate, since it does match the != 8 test.
>>>>
>>>> That's interesting. I thought the reduction definition in the
>>>> 'declare reduction' does the reduction from all the individual 
>>>> interations
>>>> according the the rules defined in the reduction irrespective of the 
>>>> loop
>>>> structure.  Maybe 'va' doesn't become implicit private and causes a race
>>>> (which may be why I didn't see it in my testing) - I'll try to repro 
>>>> this
>>>> and have a look.
>>>
>>> No, reduction privatizes the variable, initializes the private 
>>> variable with
>>> the initializer from UDR and reduces at the end only from all the private
>>> variables to the original one.
>>>
>>> For simd, each SIMD lane has one private copy and there are # SIMD lanes
>>> reductions into the original, for e.g. worksharing constructs each thread
>>> has a private copy and there are omp_get_num_threads () reductions, etc.
>>>
>>> There is no special action at the end of each loop body, it is up to the
>>> user to merge state from each iteration.  In sane code the loop body does
>>> similar operation to what the reduction does.
>>> So say if you have
>>> float sum = 0;
>>> #pragma omp parallel for reduction (+:sum)
>>> for (int i = 0; i < 1024; ++i)
>>>    sum += a[i];
>>> then if OpenMP pragmas are ignored, all the array members
>>> are summed up in that order, while if say there are 4
>>> threads and each handles 256 iterations, then each thread will
>>> start with sum = 0; and do sum += a[i]; for i omp_get_thread_num () * 256
>>> to omp_get_thread_num () * 256 + 255 inclusive and finally in some random
>>> order the 4 private floats will be summed together.
>>> For floating point that can result in different behavior (different 
>>> rounding
>>> etc.) but user said it is ok like that.
>>>
>> 
>> Thanks for the explanation.  I looked into why some of the tests may 
>> have failed - my flawed understanding of the reduction clause was why I 
>> didn't have the += in the loops - it might have passed for me as I 
>> probably hit the exact omp_get_num_threads () number required for the 
>> final += reductions to trigger from the declare clause.  As Richard 
>> said, the += in the loops ought to fix the tests. I'm still analysing 
>> inscan_reduction_incl () to fix it properly.
>
> Sorry, forgot to ask - Richard, are you happy for me to work on a 
> separate patch to fix the test alongside your target +sve fix as a 
> couple more functions (eg simd_reduction () and inscan.. () ) also need 
> fixing?

Sure, that's fine with me.  Should I apply the patch, or would you
prefer to wait until your patch is ready?

Richard

Reply via email to