https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64909
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jakub at gcc dot gnu.org, | |kyukhin at gcc dot gnu.org Blocks| |53947 --- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> --- I'm with H.J. here, can't reproduce any kind of code you are showing, the loop is normally vectorized. But, what can we see there is that for e.g. -O3 -mavx we choose vectorization factor of 8, which is based on the fact that there are 16-bit and 32-bit types used in the loop, and before AVX2 we can mostly use V4SImode and V8HImode. Compared to that, clang vectorizes it probably with vectorization factor 4 instead of 8, and as the loop has constant 12 iterations, doing it that way is beneficial. So, perhaps the question is why slp after cunroll hasn't vectorized the unrolled scalar tail loop with vectorization factor 4. pr64909.c:8:11: note: not vectorized: not enough data-refs in basic block. Although it is true that for HImode we indeed can't fill the V8HImode, it is only used immediately in an extension, which normally looks like: vect__4.7_30 = MEM[(short unsigned int *)vectp_a.6_27]; vect__5.8_31 = [vec_unpack_lo_expr] vect__4.7_30; vect__5.8_32 = [vec_unpack_hi_expr] vect__4.7_30; so all we'd need is the ability to emit a V4HImode load followed solely by vec_unpack_lo_expr from it instead of both vec_unpack_{lo,hi}_expr.