https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290
--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> --- > Am 13.08.2025 um 17:31 schrieb tnfchris at gcc dot gnu.org > <gcc-bugzi...@gcc.gnu.org>: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290 > > --- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > (In reply to rguent...@suse.de from comment #6) >>> On Wed, 13 Aug 2025, tnfchris at gcc dot gnu.org wrote: >>> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290 >>> >>> --- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> --- >>> In gimple that's >>> >>> <bb 10> [local count: 108459]: >>> x_22 = a[0]; >>> _69 = {x_22, x_22, x_22, x_22}; >>> >>> <bb 4> [local count: 10737416]: >>> # ivtmp_83 = PHI <ivtmp_84(11), 0(10)> >>> >>> <bb 5> [local count: 1063004408]: >>> # i_43 = PHI <i_24(12), 0(4)> >>> # ivtmp_34 = PHI <ivtmp_33(12), 32000(4)> >>> # vect_x_36.8_70 = PHI <vect_x_9.10_72(12), _69(4)> >>> # vect_vec_iv_.12_76 = PHI <_77(12), { 0, 0, 0, 0 }(4)> >>> # vect_index_39.13_78 = PHI <vect_index_12.14_79(12), { 0, 0, 0, 0 }(4)> >>> _4 = a[i_43]; >>> vect_cst__68 = {_4, _4, _4, _4}; >>> mask__16.9_71 = vect_cst__68 > vect_x_36.8_70; >>> vect_index_12.14_79 = VEC_COND_EXPR <mask__16.9_71, vect_vec_iv_.12_76, >>> vect_index_39.13_78>; >>> vect_x_9.10_72 = VEC_COND_EXPR <mask__16.9_71, vect_cst__68, >>> vect_x_36.8_70>; >>> i_24 = i_43 + 1; >>> ivtmp_33 = ivtmp_34 - 1; >>> _77 = vect_vec_iv_.12_76 + { 1, 1, 1, 1 }; >>> if (ivtmp_33 != 0) >>> goto <bb 12>; [98.99%] >>> else >>> goto <bb 8>; [1.01%] >>> >>> The SLP tree seems to mostly be working on lanes of externals: >>> >>> note: Vectorizing SLP tree: >>> note: node 0x42900120 (max_nunits=4, refcnt=1) vector(4) float >>> note: op template: x_41 = PHI <x_9(5)> >>> note: [l] stmt 0 x_41 = PHI <x_9(5)> >>> note: children 0x429001c8 >>> note: node 0x429001c8 (max_nunits=4, refcnt=2) vector(4) float >>> note: op template: x_9 = _16 ? _4 : x_36; >>> note: stmt 0 x_9 = _16 ? _4 : x_36; >>> note: children 0x42900270 0x42900318 0x429003c0 >>> note: node 0x42900270 (max_nunits=4, refcnt=2) vector(4) >>> <signed-boolean:32> >>> note: op template: _16 = _4 > x_36; >>> note: stmt 0 _16 = _4 > x_36; >>> note: children 0x42900318 0x429003c0 >>> note: node 0x42900318 (max_nunits=4, refcnt=2) vector(4) float >>> note: op template: _4 = a[i_43]; >>> note: stmt 0 _4 = a[i_43]; >>> note: node 0x429003c0 (max_nunits=4, refcnt=2) vector(4) float >>> note: op template: x_36 = PHI <x_9(12), x_22(4)> >>> note: stmt 0 x_36 = PHI <x_9(12), x_22(4)> >>> note: children 0x429001c8 0x42900468 >>> note: node (external) 0x42900468 (max_nunits=1, refcnt=1) vector(4) float >>> note: { x_22 } >>> >>> it also looks like we missed simplifying a > b ? a : b into just a max. >>> >>> Before we failed during analysis in the block that was removed: >>> >>> missed: Unsupported loop-closed phi in outer-loop. >>> missed: bad operation or unsupported loop bound >>> >>> and now it's a costing issue, as it's an inner loop, >>> >>> You can reduce it down to >>> >>> #define iterations 100000 >>> #define LEN_1D 32000 >>> >>> float a[LEN_1D]; >>> >>> int main() >>> { >>> float x; >>> for (int nl = 0; nl < iterations; nl++) { >>> x = a[0]; >>> for (int i = 0; i < LEN_1D; ++i) { >>> if (a[i] > x) { >>> x = a[i]; >>> } >>> } >>> } >>> >>> return x > 1; >>> } >>> >>> It looks like the access of a[0] in the outer loop is making it treat the >>> inner >>> loop as only being able to access one element at a time. >> >> Outer loop vectorization basically executes the inner loop in "scalar" >> but N outer loop iterations at the same time. Since the outer >> loop iteration is completely pointless this is expected. So, >> I'd say it's a missed optimization that we do not elide the outer >> loop completely. Note the benchmark should now execute iterations/4 >> outer loop iterations only. So if the inner loop is now slower than >> the scalar inner loop then it's a costing issue. >> > > But which costing though? Isn't this a case where there's no architecture > where > this > would be beneficial? All the lanes in the vector computation are exactly the > same... Yes, but we still reduce the number of outer loop iterations. So it must be 4 times slower to be not beneficial. > It's basically working on a series of splats because the value in the outer > loop is > invariant. > > I agree that it's a missed optimization that we didn't elide the outer loop > entirely. > and that seems to be the general theme over all of these.. Though I wonder, > are > these > reduced or are these the exact loops in tsvc? > > Where would we elide the outer loops? We do not have a pass eliding ‚invariant‘ loops in nests. Nobody would write such code and we’d ‚break‘ benchmarks. This is simply a case of a badly written benchmark… > > -- > You are receiving this mail because: > You are on the CC list for the bug.