[Bug tree-optimization/121290] [16 regression] Regressions in TSVC s119, s3113, s312, s313, s314, s315, s316 since r16-2159-g3bf2aa834e1270

rguenther at suse dot de via Gcc-bugs Wed, 13 Aug 2025 12:11:19 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290


--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> ---
> Am 13.08.2025 um 17:31 schrieb tnfchris at gcc dot gnu.org 
> <gcc-bugzi...@gcc.gnu.org>:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290
> 
> --- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> (In reply to rguent...@suse.de from comment #6)
>>> On Wed, 13 Aug 2025, tnfchris at gcc dot gnu.org wrote:
>>> 
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290
>>> 
>>> --- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
>>> In gimple that's
>>> 
>>>  <bb 10> [local count: 108459]:
>>>  x_22 = a[0];
>>>  _69 = {x_22, x_22, x_22, x_22};
>>> 
>>>  <bb 4> [local count: 10737416]:
>>>  # ivtmp_83 = PHI <ivtmp_84(11), 0(10)>
>>> 
>>>  <bb 5> [local count: 1063004408]:
>>>  # i_43 = PHI <i_24(12), 0(4)>
>>>  # ivtmp_34 = PHI <ivtmp_33(12), 32000(4)>
>>>  # vect_x_36.8_70 = PHI <vect_x_9.10_72(12), _69(4)>
>>>  # vect_vec_iv_.12_76 = PHI <_77(12), { 0, 0, 0, 0 }(4)>
>>>  # vect_index_39.13_78 = PHI <vect_index_12.14_79(12), { 0, 0, 0, 0 }(4)>
>>>  _4 = a[i_43];
>>>  vect_cst__68 = {_4, _4, _4, _4};
>>>  mask__16.9_71 = vect_cst__68 > vect_x_36.8_70;
>>>  vect_index_12.14_79 = VEC_COND_EXPR <mask__16.9_71, vect_vec_iv_.12_76,
>>> vect_index_39.13_78>;
>>>  vect_x_9.10_72 = VEC_COND_EXPR <mask__16.9_71, vect_cst__68, 
>>> vect_x_36.8_70>;
>>>  i_24 = i_43 + 1;
>>>  ivtmp_33 = ivtmp_34 - 1;
>>>  _77 = vect_vec_iv_.12_76 + { 1, 1, 1, 1 };
>>>  if (ivtmp_33 != 0)
>>>    goto <bb 12>; [98.99%]
>>>  else
>>>    goto <bb 8>; [1.01%]
>>> 
>>> The SLP tree seems to mostly be working on lanes of externals:
>>> 
>>> note:   Vectorizing SLP tree:
>>> note:   node 0x42900120 (max_nunits=4, refcnt=1) vector(4) float
>>> note:   op template: x_41 = PHI <x_9(5)>
>>> note:           [l] stmt 0 x_41 = PHI <x_9(5)>
>>> note:           children 0x429001c8
>>> note:   node 0x429001c8 (max_nunits=4, refcnt=2) vector(4) float
>>> note:   op template: x_9 = _16 ? _4 : x_36;
>>> note:           stmt 0 x_9 = _16 ? _4 : x_36;
>>> note:           children 0x42900270 0x42900318 0x429003c0
>>> note:   node 0x42900270 (max_nunits=4, refcnt=2) vector(4) 
>>> <signed-boolean:32>
>>> note:   op template: _16 = _4 > x_36;
>>> note:           stmt 0 _16 = _4 > x_36;
>>> note:           children 0x42900318 0x429003c0
>>> note:   node 0x42900318 (max_nunits=4, refcnt=2) vector(4) float
>>> note:   op template: _4 = a[i_43];
>>> note:           stmt 0 _4 = a[i_43];
>>> note:   node 0x429003c0 (max_nunits=4, refcnt=2) vector(4) float
>>> note:   op template: x_36 = PHI <x_9(12), x_22(4)>
>>> note:           stmt 0 x_36 = PHI <x_9(12), x_22(4)>
>>> note:           children 0x429001c8 0x42900468
>>> note:   node (external) 0x42900468 (max_nunits=1, refcnt=1) vector(4) float
>>> note:           { x_22 }
>>> 
>>> it also looks like we missed simplifying a > b ? a : b into just a max.
>>> 
>>> Before we failed during analysis in the block that was removed:
>>> 
>>> missed:   Unsupported loop-closed phi in outer-loop.
>>> missed:  bad operation or unsupported loop bound
>>> 
>>> and now it's a costing issue, as it's an inner loop,
>>> 
>>> You can reduce it down to
>>> 
>>> #define iterations 100000
>>> #define LEN_1D 32000
>>> 
>>> float a[LEN_1D];
>>> 
>>> int main()
>>> {
>>>    float x;
>>>    for (int nl = 0; nl < iterations; nl++) {
>>>        x = a[0];
>>>        for (int i = 0; i < LEN_1D; ++i) {
>>>            if (a[i] > x) {
>>>                x = a[i];
>>>            }
>>>        }
>>>    }
>>> 
>>>    return x > 1;
>>> }
>>> 
>>> It looks like the access of a[0] in the outer loop is making it treat the 
>>> inner
>>> loop as only being able to access one element at a time.
>> 
>> Outer loop vectorization basically executes the inner loop in "scalar"
>> but N outer loop iterations at the same time.  Since the outer
>> loop iteration is completely pointless this is expected.  So,
>> I'd say it's a missed optimization that we do not elide the outer
>> loop completely.  Note the benchmark should now execute iterations/4
>> outer loop iterations only.  So if the inner loop is now slower than
>> the scalar inner loop then it's a costing issue.
>> 
> 
> But which costing though? Isn't this a case where there's no architecture 
> where
> this
> would be beneficial? All the lanes in the vector computation are exactly the
> same...

Yes, but we still reduce the number of outer loop iterations.  So it must be 4
times slower to be not beneficial.

> It's basically working on a series of splats because the value in the outer
> loop is
> invariant.
> 
> I agree that it's a missed optimization that we didn't elide the outer loop
> entirely.
> and that seems to be the general theme over all of these.. Though I wonder, 
> are
> these
> reduced or are these the exact loops in tsvc?
> 
> Where would we elide the outer loops?

We do not have a pass eliding ‚invariant‘ loops in nests.  Nobody would write
such code and we’d ‚break‘ benchmarks.  This is simply a case of a badly
written benchmark…

> 
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug tree-optimization/121290] [16 regression] Regressions in TSVC s119, s3113, s312, s313, s314, s315, s316 since r16-2159-g3bf2aa834e1270

Reply via email to