https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91178
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- The main issue is that loop vectorization creates a chain of increments # vectp_f.21_116 = PHI <vectp_f.21_117(7), vectp_f.22_115(9)> vect__16.23_118 = MEM <vector(4) int> [(int *)vectp_f.21_116]; vectp_f.21_119 = vectp_f.21_116 + 16; vectp_f.21_121 = vectp_f.21_119 + 16; vectp_f.21_123 = vectp_f.21_121 + 16; vectp_f.21_125 = vectp_f.21_123 + 16; ... vectp_f.21_182363 = vectp_f.21_182361 + 16; vectp_f.21_182365 = vectp_f.21_182363 + 16; vectp_f.21_182367 = vectp_f.21_182365 + 16; vect__16.91149_182369 = VEC_PERM_EXPR <vect__16.23_118, vect__16.23_118, { 0, 0, 0, 0 }>; vect__16.91150_182370 = VEC_PERM_EXPR <vect__16.23_118, vect__16.23_118, { 0, 0, 0, 0 }>; vect__16.91151_182371 = VEC_PERM_EXPR <vect__16.23_118, vect__16.22804_45680, { 0, 6, 6, 6 }>; vect__16.91152_182372 = VEC_PERM_EXPR <vect__16.22804_45680, vect__16.22804_45680, { 2, 2, 2, 2 }>; vect__16.91153_182373 = VEC_PERM_EXPR <vect__16.22804_45680, vect__16.45586_91244, { 2, 2, 4, 4 }>; vect__16.91154_182374 = VEC_PERM_EXPR <vect__16.45586_91244, vect__16.45586_91244, { 0, 0, 0, 0 }>; vect__16.91155_182375 = VEC_PERM_EXPR <vect__16.45586_91244, vect__16.68367_136806, { 0, 0, 0, 6 }>; vect__16.91156_182376 = VEC_PERM_EXPR <vect__16.68367_136806, vect__16.68367_136806, { 2, 2, 2, 2 }>; vect__16.91157_182377 = VEC_PERM_EXPR <vect__16.68367_136806, vect__16.68367_136806, { 2, 2, 2, 2 }>; vect__73.91158_182378 = vect__73.20_106 - vect__16.91149_182369; vect__73.91158_182379 = vect__73.20_107 - vect__16.91150_182370; vect__73.91158_182380 = vect__73.20_108 - vect__16.91151_182371; vect__73.91158_182381 = vect__73.20_109 - vect__16.91152_182372; vect__73.91158_182382 = vect__73.20_110 - vect__16.91153_182373; vect__73.91158_182383 = vect__73.20_111 - vect__16.91154_182374; vect__73.91158_182384 = vect__73.20_112 - vect__16.91155_182375; vect__73.91158_182385 = vect__73.20_113 - vect__16.91156_182376; vect__73.91158_182386 = vect__73.20_114 - vect__16.91157_182377; vectp_f.21_117 = vectp_f.21_182367 + 16; ivtmp_182463 = ivtmp_182462 + 1; if (ivtmp_182463 < bnd.17_102) goto <bb 7>; [0.00%] else goto <bb 11>; [100.00%] where it first generates one load for each of the increments and then the permutation makes most of them dead. For interleaving we have some cut-off to avoid this kind of code-gen but for SLP we don't. DR group size is 91126 here and gap 91125 (aka single element interleaving).