https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
            Summary|[14/15 regression] floating |[14/15 Regression] floating
                   |point vector regression,    |point vector regression,
                   |x86, between gcc 14 and     |x86, between gcc 14 and
                   |gcc-13 using -O3 and target |gcc-13 using -O3 and target
                   |clones on skylake platforms |clones on skylake platforms
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-05-10
             Target|x86_64                      |x86_64-*-*
   Target Milestone|---                         |14.2

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can't reproduce a slowdown on a Zen2 CPU.  The difference seems to be merely
instruction scheduling.  I do note we're not doing a good job in handling

        for (i = 0; i < LOOPS_PER_CALL; i++) {
                r.v = r.v + add.v;
        }

where r.v and add.v are AVX512 sized vectors when emulating them with AVX
vectors.  We end up with

  r_v_lsm.48_48 = r.v;
  _11 = add.v;

  <bb 3> [local count: 1063004408]:
  # r_v_lsm.48_50 = PHI <_12(3), r_v_lsm.48_48(2)>
  # ivtmp_56 = PHI <ivtmp_55(3), 65536(2)>
  _16 = BIT_FIELD_REF <_11, 256, 0>;
  _37 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 0>;
  _29 = _16 + _37;
  _387 = BIT_FIELD_REF <_11, 256, 256>;
  _375 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 256>;
  _363 = _387 + _375;
  _12 = {_29, _363};
  ivtmp_55 = ivtmp_56 - 1;
  if (ivtmp_55 != 0)
    goto <bb 3>; [98.99%]
  else
    goto <bb 4>; [1.01%]

  <bb 4> [local count: 10737416]:

after lowering from 512bit to 256bit vectors and there's no pass that
would demote the 512bit reduction value to two 256bit ones.

There's also weird things going on in the target/on RTL.  A smaller testcase
illustrating the code generation issue is

typedef float v16sf __attribute__((vector_size(sizeof(float)*16)));

void foo (v16sf * __restrict r, v16sf *a, int n)
{
  for (int i = 0; i < n; ++i)
    *r = *r + *a;
}

So confirmed for non-optimal code but I don't see how it's a regression.

Reply via email to