https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116950
Bug ID: 116950 Summary: IVopts missed unification of duplicate IVs Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- consider the following testcase extracted from a library: #include <stdint.h> #include <arm_sve.h> void func1(uint32_t n, const float32_t *a, float32_t *c) { uint32_t num_lanes = svcntw(); svbool_t pg = svptrue_b32(); uint32_t i = 0; for (; (i + 4) * num_lanes <= n; i += 4) { svfloat32_t vec_a = svld1_vnum_f32(pg, a, i); svfloat32_t vec_c = svmul_f32_x(pg, vec_a, vec_a); svst1_vnum_f32(pg, c, i, vec_c); } return; } void func2(uint32_t n, const float32_t *a, float32_t *c) { uint32_t num_lanes = svcntw(); svbool_t pg = svptrue_b32(); int32_t i = 0; for (; (i + 4) * num_lanes <= n; i += 4) { svfloat32_t vec_a = svld1_vnum_f32(pg, a, i); svfloat32_t vec_c = svmul_f32_x(pg, vec_a, vec_a); svst1_vnum_f32(pg, c, i, vec_c); } return; } and compiled with -O3 -march=armv9-a. They differ only in the sign of i, the loop IV. The first one is more canonical as it avoids comparing signed to unsigned values. However the first loop produces quite bad code: .L3: uxtw x3, w6 cmp w0, w5 add w6, w6, 4 incb x5 mul x3, x3, x8 add x7, x1, x3 add x3, x2, x3 ld1w z31.s, p7/z, [x7] fmul z31.s, z31.s, z31.s st1w z31.s, p7, [x3] bcs .L3 while the second one: .L3: ld1w z31.s, p7/z, [x1, x3, lsl 2] fmul z31.s, z31.s, z31.s st1w z31.s, p7, [x2, x3, lsl 2] incb x3 addvl x4, x3, #1 cmp w0, w4 bcs .L3 This leads up to a 40% performance difference between the two loops. It seems that in the second case IVopts doesn't merge the two IVs.e.g. first one has as input to IVopts: <bb 3> [local count: 955630224]: # _21 = PHI <_1(6), 4(5)> # i_22 = PHI <_21(6), 0(5)> ... _1 = _21 + 4; _2 = _1 * POLY_INT_CST [4, 4]; if (_2 <= n_6(D)) goto <bb 6>; [89.00%] and second one: <bb 3> [local count: 955630224]: # _23 = PHI <_1(6), 4(5)> # i_24 = PHI <_23(6), 0(5)> ... _1 = _23 + 4; _2 = (unsigned int) _1; _3 = _2 * POLY_INT_CST [4, 4]; if (_3 <= n_7(D)) goto <bb 6>; [89.00%] I'm not sure if this is the exact cause, but for the first one niters seems to fail: ;; ;; Loop 1 ;; header 3, latch 6 ;; depth 1, outer 0 ;; niter scev_not_known ;; iterations by profile: 8.090909 (unreliable, maybe flat) entry count:105119324 (estimated locally, freq 0.8900) ;; nodes: 3 6 Processing loop 1 at /app/example.c:9 single exit 3 -> 4, exit condition if (_2 <= n_6(D)) and thinks the IVs can overflow wrt niters: IV struct: SSA_NAME: i_22 Type: uint32_t Base: 0 Step: 4 Biv: N Overflowness wrto loop niter: Overflow etc. I guess this is due to that for the unsigned case if N is large enough the loop may not terminate at all due to an overflow wrapping. but -ffinite-loops doesn't seem to help. The signed case does exhibit the same behavior when -fwrapv is used to indicate the overflow behavior. but the signed case: ;; ;; Loop 1 ;; header 3, latch 6 ;; depth 1, outer 0, finite_p ;; niter scev_not_known ;; upper_bound 536870909 ;; likely_upper_bound 536870909 ;; iterations by profile: 8.090909 (unreliable, maybe flat) entry count:105119324 (estimated locally, freq 0.8900) ;; nodes: 3 6 Processing loop 1 at /app/example.c:23 single exit 3 -> 4, exit condition if (_3 <= n_7(D)) IV struct: SSA_NAME: i_24 Type: int32_t Base: 0 Step: 4 Biv: N Overflowness wrto loop niter: No-overflow So the reason for this bug report is to see if we can't do anything about it. when -ffinite-loops could we not assume such loops terminate and is bound by UINT_MAX? p.s. even though -ffinite-loops seems to be default at -O2, adding -ffinite-loops explicitly does seem to toggle something in niters.