https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- Btw, GCC 14.2 doesn't vectorize for me anymore, likely because the use of gather has been nerfed (for Intel). With AVX2 we see <bb 3> [local count: 139586405]: # vect_total_21.15_24 = PHI <vect_total_11.24_41(3), { 0, 0, 0, 0, 0, 0, 0, 0 }(2)> # vect_vec_iv_.16_20 = PHI <_18(3), { 0, 1, 2, 3, 4, 5, 6, 7 }(2)> # ivtmp.27_27 = PHI <ivtmp.27_23(3), 0(2)> _18 = vect_vec_iv_.16_20 + { 8, 8, 8, 8, 8, 8, 8, 8 }; vect__17.17_5 = vect_vec_iv_.16_20 >> 5; vect__19.18_3 = vect_vec_iv_.16_20 & { 31, 31, 31, 31, 31, 31, 31, 31 }; vect_30 = VIEW_CONVERT_EXPR<vector(8) int>(vect__17.17_5); vect_31 = __builtin_ia32_gathersiv8si ({ 0, 0, 0, 0, 0, 0, 0, 0 }, &values, vect_30, { -1, -1, -1, -1, -1, -1, -1, -1 }, 4); vect__13.19_32 = VIEW_CONVERT_EXPR<vector(8) long unsigned int>(vect_31); vect__14.20_34 = { 1, 1, 1, 1, 1, 1, 1, 1 } << vect__19.18_3; vect__15.21_35 = vect__13.19_32 & vect__14.20_34; mask__16.22_37 = vect__15.21_35 != { 0, 0, 0, 0, 0, 0, 0, 0 }; _51 = VIEW_CONVERT_EXPR<vector(8) unsigned int>(mask__16.22_37); vect_total_11.24_41 = vect_total_21.15_24 - _51; ivtmp.27_23 = ivtmp.27_27 + 1; if (ivtmp.27_23 != 1600000) so the first point is we are not able to analyze the memory access pattern in a very good way and then of course cost modeling breaks down here as well. The scalar IL is <bb 3> [local count: 1063004409]: # total_21 = PHI <total_11(5), 0(2)> # index_23 = PHI <index_12(5), 0(2)> # ivtmp_27 = PHI <ivtmp_26(5), 12800000(2)> _17 = index_23 >> 5; _19 = index_23 & 31; _13 = MEM <struct _Base_bitset> [(_WordT *)&values]._M_w[_17]; _14 = 1 << _19; _15 = _13 & _14; _16 = _15 != 0; _1 = (unsigned int) _16; total_11 = _1 + total_21; index_12 = index_23 + 1; ivtmp_26 = ivtmp_27 - 1; if (ivtmp_26 != 0) I think for this kind of access pattern it might be nice to unroll the loop as many times as the same memory location is accessed (1 << 5, aka 32 times). But as Andrew says - it's just a very bad testcase ;)