[Bug target/117008] -march=native pessimization of 25% with bitset []

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 08 Oct 2024 00:57:14 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, GCC 14.2 doesn't vectorize for me anymore, likely because the use of
gather has been nerfed (for Intel).

With AVX2 we see

  <bb 3> [local count: 139586405]:
  # vect_total_21.15_24 = PHI <vect_total_11.24_41(3), { 0, 0, 0, 0, 0, 0, 0, 0
}(2)>
  # vect_vec_iv_.16_20 = PHI <_18(3), { 0, 1, 2, 3, 4, 5, 6, 7 }(2)>
  # ivtmp.27_27 = PHI <ivtmp.27_23(3), 0(2)>
  _18 = vect_vec_iv_.16_20 + { 8, 8, 8, 8, 8, 8, 8, 8 };
  vect__17.17_5 = vect_vec_iv_.16_20 >> 5;
  vect__19.18_3 = vect_vec_iv_.16_20 & { 31, 31, 31, 31, 31, 31, 31, 31 };
  vect_30 = VIEW_CONVERT_EXPR<vector(8) int>(vect__17.17_5);
  vect_31 = __builtin_ia32_gathersiv8si ({ 0, 0, 0, 0, 0, 0, 0, 0 }, &values,
vect_30, { -1, -1, -1, -1, -1, -1, -1, -1 }, 4);
  vect__13.19_32 = VIEW_CONVERT_EXPR<vector(8) long unsigned int>(vect_31);
  vect__14.20_34 = { 1, 1, 1, 1, 1, 1, 1, 1 } << vect__19.18_3;
  vect__15.21_35 = vect__13.19_32 & vect__14.20_34;
  mask__16.22_37 = vect__15.21_35 != { 0, 0, 0, 0, 0, 0, 0, 0 };
  _51 = VIEW_CONVERT_EXPR<vector(8) unsigned int>(mask__16.22_37);
  vect_total_11.24_41 = vect_total_21.15_24 - _51;
  ivtmp.27_23 = ivtmp.27_27 + 1;
  if (ivtmp.27_23 != 1600000)

so the first point is we are not able to analyze the memory access pattern
in a very good way and then of course cost modeling breaks down here as well.

The scalar IL is

  <bb 3> [local count: 1063004409]:
  # total_21 = PHI <total_11(5), 0(2)>
  # index_23 = PHI <index_12(5), 0(2)>
  # ivtmp_27 = PHI <ivtmp_26(5), 12800000(2)>
  _17 = index_23 >> 5;
  _19 = index_23 & 31;
  _13 = MEM <struct _Base_bitset> [(_WordT *)&values]._M_w[_17];
  _14 = 1 << _19;
  _15 = _13 & _14;
  _16 = _15 != 0;
  _1 = (unsigned int) _16;
  total_11 = _1 + total_21;
  index_12 = index_23 + 1;
  ivtmp_26 = ivtmp_27 - 1;
  if (ivtmp_26 != 0)

I think for this kind of access pattern it might be nice to unroll the
loop as many times as the same memory location is accessed (1 << 5, aka
32 times).

But as Andrew says - it's just a very bad testcase ;)

[Bug target/117008] -march=native pessimization of 25% with bitset []

Reply via email to