https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116950

            Bug ID: 116950
           Summary: IVopts missed unification of duplicate IVs
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

consider the following  testcase extracted from a library:

#include <stdint.h>
#include <arm_sve.h>

void func1(uint32_t n, const float32_t *a, float32_t *c) {
  uint32_t num_lanes = svcntw();
  svbool_t pg = svptrue_b32();
  uint32_t i = 0;
  for (; (i + 4) * num_lanes <= n; i += 4) {
    svfloat32_t vec_a = svld1_vnum_f32(pg, a, i);
    svfloat32_t vec_c = svmul_f32_x(pg, vec_a, vec_a);
    svst1_vnum_f32(pg, c, i, vec_c);
  }
  return;
}

void func2(uint32_t n, const float32_t *a, float32_t *c) {
  uint32_t num_lanes = svcntw();
  svbool_t pg = svptrue_b32();
  int32_t i = 0;
  for (; (i + 4) * num_lanes <= n; i += 4) {
    svfloat32_t vec_a = svld1_vnum_f32(pg, a, i);
    svfloat32_t vec_c = svmul_f32_x(pg, vec_a, vec_a);
    svst1_vnum_f32(pg, c, i, vec_c);
  }
  return;
}

and compiled with -O3 -march=armv9-a.

They differ only in the sign of i, the loop IV.  The first one is more
canonical  as it avoids comparing signed to unsigned values.

However the first loop produces quite bad code:

.L3:
        uxtw    x3, w6
        cmp     w0, w5
        add     w6, w6, 4
        incb    x5
        mul     x3, x3, x8
        add     x7, x1, x3
        add     x3, x2, x3
        ld1w    z31.s, p7/z, [x7]
        fmul    z31.s, z31.s, z31.s
        st1w    z31.s, p7, [x3]
        bcs     .L3

while the second one:

.L3:
        ld1w    z31.s, p7/z, [x1, x3, lsl 2]
        fmul    z31.s, z31.s, z31.s
        st1w    z31.s, p7, [x2, x3, lsl 2]
        incb    x3
        addvl   x4, x3, #1
        cmp     w0, w4
        bcs     .L3

This leads up to a 40% performance difference between the two loops.

It seems that in the second case IVopts doesn't merge the two IVs.e.g. first
one has as input to IVopts:

  <bb 3> [local count: 955630224]:
  # _21 = PHI <_1(6), 4(5)>
  # i_22 = PHI <_21(6), 0(5)>
...
  _1 = _21 + 4;
  _2 = _1 * POLY_INT_CST [4, 4];
  if (_2 <= n_6(D))
    goto <bb 6>; [89.00%]

and second one:

  <bb 3> [local count: 955630224]:
  # _23 = PHI <_1(6), 4(5)>
  # i_24 = PHI <_23(6), 0(5)>
...
  _1 = _23 + 4;
  _2 = (unsigned int) _1;
  _3 = _2 * POLY_INT_CST [4, 4];
  if (_3 <= n_7(D))
    goto <bb 6>; [89.00%]

I'm not sure if this is the exact cause, but for the first one niters seems to
fail:

;;
;; Loop 1
;;  header 3, latch 6
;;  depth 1, outer 0
;;  niter scev_not_known
;;  iterations by profile: 8.090909 (unreliable, maybe flat) entry
count:105119324 (estimated locally, freq 0.8900)
;;  nodes: 3 6
Processing loop 1 at /app/example.c:9
  single exit 3 -> 4, exit condition if (_2 <= n_6(D))

and thinks the IVs can overflow wrt niters:

IV struct:
  SSA_NAME:     i_22
  Type: uint32_t
  Base: 0
  Step: 4
  Biv:  N
  Overflowness wrto loop niter: Overflow

etc.

I guess this is due to that for the unsigned case if N is large enough the loop
may not terminate at all due to an overflow wrapping. but -ffinite-loops
doesn't seem to help.  The signed case does exhibit the same behavior when
-fwrapv is used to indicate the overflow behavior.

but the signed case:

;;
;; Loop 1
;;  header 3, latch 6
;;  depth 1, outer 0, finite_p
;;  niter scev_not_known
;;  upper_bound 536870909
;;  likely_upper_bound 536870909
;;  iterations by profile: 8.090909 (unreliable, maybe flat) entry
count:105119324 (estimated locally, freq 0.8900)
;;  nodes: 3 6
Processing loop 1 at /app/example.c:23
  single exit 3 -> 4, exit condition if (_3 <= n_7(D))

IV struct:
  SSA_NAME:     i_24
  Type: int32_t
  Base: 0
  Step: 4
  Biv:  N
  Overflowness wrto loop niter: No-overflow

So the reason for this bug report is to see if we can't do anything about it.
when -ffinite-loops could we not assume such loops terminate and is bound by
UINT_MAX?

p.s. even though -ffinite-loops seems to be default at -O2, adding
-ffinite-loops explicitly does seem to toggle something in niters.

Reply via email to