[Bug tree-optimization/102750] 433.milc regressed by 10% on AMD zen2 at -Ofast -march=native -flto after r12-3893-g6390c5047adb75

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 15 Oct 2021 05:49:59 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102750


--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 51609
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51609&action=edit
testcase

This testcase reproduces the vectorization difference with REV and REV^ when
using -Ofast -march=znver2 [-fno-early-inlinig].  Trunk shows similar behavior
still.

The issue is likely code like

  vect__325.61_64 = MEM <vector(2) double> [(double *)_80];
  vect__325.62_60 = MEM <vector(2) double> [(double *)_80 + 16B];
  vect__325.64_53 = VEC_PERM_EXPR <vect__325.61_64, vect__325.62_60, { 0, 2 }>;
  _52 = BIT_FIELD_REF <vect__325.62_60, 64, 0>;

where it would have been better to emit two scalar loads and combine them
with a CTOR (or in asm use movsd + movhpd).  But we end up with

.L5:
        vmovupd (%rcx), %xmm3
        vmovupd 16(%rcx), %xmm2
...
        addq    $48, %rcx
...
        vunpckhpd       %xmm2, %xmm3, %xmm7
...
        vfnmadd132sd    -48(%rcx), %xmm8, %xmm15

also scattered around (but that's not GIMPLEs fault).

The vectorizer generates

  vectp.60_65 = &_6->c[0].real;
  vect__325.61_64 = MEM <vector(2) double> [(double *)vectp.60_65];
  vectp.60_61 = vectp.60_65 + 16;
  vect__325.62_60 = MEM <vector(2) double> [(double *)vectp.60_61];
  vectp.60_57 = vectp.60_65 + 32;
  vect__325.64_53 = VEC_PERM_EXPR <vect__325.61_64, vect__325.62_60, { 0, 2 }>;
  _52 = BIT_FIELD_REF <vect__325.64_53, 64, 64>;

which is of course entirely reasonable in some sense (that's 12 + 12 + 4 + 4
cost - two scalar loads plus CTOR would cost 12 + 12 + 8 but we'd likely
still generate and cost the scalar extract).  Eventually it's better to
pattern-match the permute and demote the vector loads to scalar...

We also end up with

  vect__259.20_179 = .FNMA (_182, vect__326.17_185, _196);
  vect__368.19_180 = .FMA (_182, vect__326.17_185, _196);
  _178 = VEC_PERM_EXPR <vect__368.19_180, vect__259.20_179, { 0, 5, 2, 7 }>;

and unhandled

add_force_to_mom.c:72:60: note:   node 0x3ed37e0 (max_nunits=4, refcnt=2)
add_force_to_mom.c:72:60: note:   op template: _362 = _6->c[1].real;
add_force_to_mom.c:72:60: note:         stmt 0 _362 = _6->c[1].real;
add_force_to_mom.c:72:60: note:         stmt 1 _362 = _6->c[1].real;
add_force_to_mom.c:72:60: note:         stmt 2 _399 = _6->c[2].real;
add_force_to_mom.c:72:60: note:         stmt 3 _399 = _6->c[2].real;
add_force_to_mom.c:72:60: note:         load permutation { 2 2 4 4 }
....
add_force_to_mom.c:72:60: note:   ==> examining statement: _362 =
_6->c[1].real;
add_force_to_mom.c:72:60: missed:   BB vectorization with gaps at the end of a
load is not supported
add_force_to_mom.c:53:16: missed:   not vectorized: relevant stmt not
supported: _362 = _6->c[1].real;
add_force_to_mom.c:72:60: note:   Building vector operands of 0x3ed37e0 from
scalars instead

[Bug tree-optimization/102750] 433.milc regressed by 10% on AMD zen2 at -Ofast -march=native -flto after r12-3893-g6390c5047adb75

Reply via email to