On Tue, 11 Nov 2025, Victor Do Nascimento wrote:

> Sorry about delayed reply,I've been obsessing on trying to solve one
> last snag in the peeling for alignment implementation.
> 
> The reduction work was done trying to fix a regression I'd introduced in
> the libstdc++-v3 unit tests during the course of the implementing of
> these patches.
> 
> I reduced the unit test in question and I'm attaching that at the end of this
> message.
> 
> Ironically, in writing up different testcases in plain C for
> gcc.dg/vect, I've since discovered I fail to vectorize far simpler reduction
> cases.
> 
> For the trivial case of
> 
>   sum = 0;
>   while (1)
>     {
>       if (a[i] == 0) break;
>       sum += a[i];
>       i++;
>     }
> 
> we get the following CFG:
> 
>   <bb 2> [local count: 118111600]:
>   _17 = *a_8(D);
>   if (_17 == 0)
>     goto <bb 7>; [11.00%]
>   else
>     goto <bb 5>; [89.00%]
> 
>   <bb 5> [local count: 105119324]:
> 
>   <bb 3> [local count: 955630224]:
>   # _18 =
>   # sum_19 = PHI <sum_10(6), sum_7(D)(5)>
>   # i_21 = PHI <i_11(6), 0(5)>
>   sum_10 = _18 + sum_19;
>   i_11 = i_21 + 1;
>   _1 = (long unsigned int) i_11;
>   _2 = _1 * 4;
>   _3 = a_8(D) + _2;
>   _4 = *_3;
>   if (_4 == 0)
>     goto <bb 8>; [11.00%]
>   else
>     goto <bb 6>; [89.00%]
> 
> and the vectorizer doesn't quite know how to handle the PHI <_4(6), _17(5)>,
> categorized as `vect_unknown_def_type', so I'll figure out what to do about
> that.

Generally for reductions we do not support an intermediate (aka _18
here) result to be live because as reductions are re-associated we
cannot compute this specific value.  Now, specifically _this_ value
we _can_ compute, so the restriction is a bit over-eager.

But it's not sth to solve as part of this series.  Would be interesting
to track in a bug though.

> Now for the promised testcase...
> 
> #include <numeric>
> #include <iterator>
> #include <cassert>
> 
> int a[]  = {4, 5, 6, 7, 8, 9, 10, 11};
> double b[] = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5};
> int N = 8;
> 
> template<typename _InputIterator1, typename _InputIterator2, typename _Tp>
> _Tp
> transform_reduce_a(_InputIterator1 a0, _InputIterator1 aN,
>                 _InputIterator2 b0, _Tp accum)
> {
>   while ((aN - a0) >= 4)

So we fail to analyze this as a counted loop?

>     {
>       _Tp __v1 = (a0[0] * b0[0]) + (a0[1] * b0[1]);
>       _Tp __v2 = (a0[2] * b0[2]) + (a0[3] * b0[3]);
>       _Tp __v3 = (__v1 + __v2);
>       accum = (accum + __v3);
>       a0 += 4;
>       b0 += 4;
>     }
>   for (; a0 != aN; ++a0, (void) ++b0)
>     accum = (accum + (*a0 * *b0));
>   return accum;
> }
> 
> void
> test01()
> {
>   auto res = transform_reduce_a(std::begin(a), std::end(a), std::begin(b),
>                              std::move (1.0f));
>   assert( res == (float)(1 + 30) );
> }
> int
> main()
> {
>   test01();
> }
> 
> 
> Many thanks,
> Victor
> 
> On 11/11/25 13:59, Richard Biener wrote:
> > On Tue, 11 Nov 2025, Tamar Christina wrote:
> > 
> >>> -----Original Message-----
> >>> From: Richard Biener <[email protected]>
> >>> Sent: 11 November 2025 12:59
> >>> To: Tamar Christina <[email protected]>
> >>> Cc: Victor Do Nascimento <[email protected]>; gcc-
> >>> [email protected]
> >>> Subject: RE: [PATCH 08/13] vect: Reclassify early break fold left
> >>> reductions as
> >>> simple reductions
> >>>
> >>> On Tue, 11 Nov 2025, Tamar Christina wrote:
> >>>
> >>>>> -----Original Message-----
> >>>>> From: Richard Biener <[email protected]>
> >>>>> Sent: 11 November 2025 12:16
> >>>>> To: Victor Do Nascimento <[email protected]>
> >>>>> Cc: [email protected]; Tamar Christina
> >>> <[email protected]>;
> >>>>> Victor Do Nascimento <[email protected]
> >>>>> 1.compute.internal>
> >>>>> Subject: Re: [PATCH 08/13] vect: Reclassify early break fold left
> >>>>> reductions
> >>> as
> >>>>> simple reductions
> >>>>>
> >>>>> On Mon, 10 Nov 2025, Victor Do Nascimento wrote:
> >>>>>
> >>>>>> From: Victor Do Nascimento <[email protected]
> >>>>> 1.compute.internal>
> >>>>>>
> >>>>>> This re-categorization of reductions for uncounted loops involving
> >>>>>> reductions leads to the correct calling of
> >>>>>> `vect_create_epilog_for_reduction' function.
> >>>>>>
> >>>>>> gcc/ChangeLog:
> >>>>>>
> >>>>>>  * tree-vect-loop.cc (vectorizable_reduction): Reclassify
> >>>>>>  uncounted-loop VECT_REDUC_INFO_TYPE as
> >>>>> TREE_CODE_REDUCTION.
> >>>>>> ---
> >>>>>>   gcc/tree-vect-loop.cc | 5 +++--
> >>>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> >>>>>> index 901903cfbea..3b038169c95 100644
> >>>>>> --- a/gcc/tree-vect-loop.cc
> >>>>>> +++ b/gcc/tree-vect-loop.cc
> >>>>>> @@ -7426,8 +7426,9 @@ vectorizable_reduction (loop_vec_info
> >>>>> loop_vinfo,
> >>>>>>                                 "supported.\n");
> >>>>>>          return false;
> >>>>>>            }
> >>>>>> -        VECT_REDUC_INFO_TYPE (reduc_info)
> >>>>>> -          = reduction_type = FOLD_LEFT_REDUCTION;
> >>>>>> +        VECT_REDUC_INFO_TYPE (reduc_info) = reduction_type
> >>>>>> +          = LOOP_VINFO_NITERS_UNCOUNTED_P (loop_vinfo) ?
> >>>>> TREE_CODE_REDUCTION
> >>>>>> +          : FOLD_LEFT_REDUCTION;
> >>>>>
> >>>>> I don't think this is correct.  We've arrived here with a
> >>>>> needs_fold_left_reduction_p check, if we cannot use a
> >>>>> FOLD_LEFT_REDUCTION
> >>>>> we have to fail.
> >>>>>
> >>>>> That said, instead of vect_create_epilog_for_reduction this goes
> >>>>> through vectorize_fold_left_reduction which re-uses the original
> >>>>> scalar reduction PHI and thus any specific early-break handling would
> >>>>> need to go there.
> >>>>>
> >>>>> I believe that if this is an issue with respect to re-starting then
> >>>>> that very same issue is present generally for early break vectorization.
> >>>>
> >>>> Agree, I think vectorizable_reduction is missing support for reducing
> >>>> from def 0.
> >>>>
> >>>> Note that we mostly normally fail to analyse the reduction so we never
> >>>> get here hence the missing support, so I'm somewhat surprised uncounted
> >>> loops did.
> >>>>
> >>>> Is there a testcase that shows this?
> >>>
> >>> Just add a FP reduction w/o -ffast-math to any existing early break
> >>> testcase?  You can simply reduce x += 5. or so I think, so no loads
> >>> necessary.
> >>
> >> You mean like this? https://godbolt.org/z/4jbKx7j5a
> >>
> >> At first glance that looks correct to me, the early exits use ret_12 and
> >> the
> >> main exit uses ret_6.
> >>
> >> So that's handled correctly.
> > 
> > Indeed.  So I wonder what goes wrong in the uncounted case - and
> > possibly the peeled case with early break.
> > 
> > So I echo Tamar then, Victor, do you have a testcase that shows what
> > goes wrong?
> > 
> > Thanks,
> > Richard.
> 
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to