Sorry about delayed reply,I've been obsessing on trying to solve one
last snag in the peeling for alignment implementation.
The reduction work was done trying to fix a regression I'd introduced in
the libstdc++-v3 unit tests during the course of the implementing of
these patches.
I reduced the unit test in question and I'm attaching that at the end of
this message.
Ironically, in writing up different testcases in plain C for
gcc.dg/vect, I've since discovered I fail to vectorize far simpler
reduction cases.
For the trivial case of
sum = 0;
while (1)
{
if (a[i] == 0) break;
sum += a[i];
i++;
}
we get the following CFG:
<bb 2> [local count: 118111600]:
_17 = *a_8(D);
if (_17 == 0)
goto <bb 7>; [11.00%]
else
goto <bb 5>; [89.00%]
<bb 5> [local count: 105119324]:
<bb 3> [local count: 955630224]:
# _18 =
# sum_19 = PHI <sum_10(6), sum_7(D)(5)>
# i_21 = PHI <i_11(6), 0(5)>
sum_10 = _18 + sum_19;
i_11 = i_21 + 1;
_1 = (long unsigned int) i_11;
_2 = _1 * 4;
_3 = a_8(D) + _2;
_4 = *_3;
if (_4 == 0)
goto <bb 8>; [11.00%]
else
goto <bb 6>; [89.00%]
and the vectorizer doesn't quite know how to handle the PHI <_4(6),
_17(5)>, categorized as `vect_unknown_def_type', so I'll figure out what
to do about that.
Now for the promised testcase...
#include <numeric>
#include <iterator>
#include <cassert>
int a[] = {4, 5, 6, 7, 8, 9, 10, 11};
double b[] = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5};
int N = 8;
template<typename _InputIterator1, typename _InputIterator2, typename _Tp>
_Tp
transform_reduce_a(_InputIterator1 a0, _InputIterator1 aN,
_InputIterator2 b0, _Tp accum)
{
while ((aN - a0) >= 4)
{
_Tp __v1 = (a0[0] * b0[0]) + (a0[1] * b0[1]);
_Tp __v2 = (a0[2] * b0[2]) + (a0[3] * b0[3]);
_Tp __v3 = (__v1 + __v2);
accum = (accum + __v3);
a0 += 4;
b0 += 4;
}
for (; a0 != aN; ++a0, (void) ++b0)
accum = (accum + (*a0 * *b0));
return accum;
}
void
test01()
{
auto res = transform_reduce_a(std::begin(a), std::end(a), std::begin(b),
std::move (1.0f));
assert( res == (float)(1 + 30) );
}
int
main()
{
test01();
}
Many thanks,
Victor
On 11/11/25 13:59, Richard Biener wrote:
On Tue, 11 Nov 2025, Tamar Christina wrote:
-----Original Message-----
From: Richard Biener <[email protected]>
Sent: 11 November 2025 12:59
To: Tamar Christina <[email protected]>
Cc: Victor Do Nascimento <[email protected]>; gcc-
[email protected]
Subject: RE: [PATCH 08/13] vect: Reclassify early break fold left reductions as
simple reductions
On Tue, 11 Nov 2025, Tamar Christina wrote:
-----Original Message-----
From: Richard Biener <[email protected]>
Sent: 11 November 2025 12:16
To: Victor Do Nascimento <[email protected]>
Cc: [email protected]; Tamar Christina
<[email protected]>;
Victor Do Nascimento <[email protected]
1.compute.internal>
Subject: Re: [PATCH 08/13] vect: Reclassify early break fold left reductions
as
simple reductions
On Mon, 10 Nov 2025, Victor Do Nascimento wrote:
From: Victor Do Nascimento <[email protected]
1.compute.internal>
This re-categorization of reductions for uncounted loops involving
reductions leads to the correct calling of
`vect_create_epilog_for_reduction' function.
gcc/ChangeLog:
* tree-vect-loop.cc (vectorizable_reduction): Reclassify
uncounted-loop VECT_REDUC_INFO_TYPE as
TREE_CODE_REDUCTION.
---
gcc/tree-vect-loop.cc | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 901903cfbea..3b038169c95 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7426,8 +7426,9 @@ vectorizable_reduction (loop_vec_info
loop_vinfo,
"supported.\n");
return false;
}
- VECT_REDUC_INFO_TYPE (reduc_info)
- = reduction_type = FOLD_LEFT_REDUCTION;
+ VECT_REDUC_INFO_TYPE (reduc_info) = reduction_type
+ = LOOP_VINFO_NITERS_UNCOUNTED_P (loop_vinfo) ?
TREE_CODE_REDUCTION
+ : FOLD_LEFT_REDUCTION;
I don't think this is correct. We've arrived here with a
needs_fold_left_reduction_p check, if we cannot use a
FOLD_LEFT_REDUCTION
we have to fail.
That said, instead of vect_create_epilog_for_reduction this goes
through vectorize_fold_left_reduction which re-uses the original
scalar reduction PHI and thus any specific early-break handling would
need to go there.
I believe that if this is an issue with respect to re-starting then
that very same issue is present generally for early break vectorization.
Agree, I think vectorizable_reduction is missing support for reducing
from def 0.
Note that we mostly normally fail to analyse the reduction so we never
get here hence the missing support, so I'm somewhat surprised uncounted
loops did.
Is there a testcase that shows this?
Just add a FP reduction w/o -ffast-math to any existing early break
testcase? You can simply reduce x += 5. or so I think, so no loads
necessary.
You mean like this? https://godbolt.org/z/4jbKx7j5a
At first glance that looks correct to me, the early exits use ret_12 and the
main exit uses ret_6.
So that's handled correctly.
Indeed. So I wonder what goes wrong in the uncounted case - and
possibly the peeled case with early break.
So I echo Tamar then, Victor, do you have a testcase that shows what
goes wrong?
Thanks,
Richard.