https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91573
Tamar Christina <tnfchris at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tnfchris at gcc dot gnu.org --- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> --- In this case though, wouldn't the loop vectorizer also be able to handle it if the permute was simpler? re-rolling the loop or creating a minimum SLP tree should be equivalent to char src[512]; char dst[512]; #define WIDTH 8 void foo(int height, int a, int b, int c, int d, int dst_stride) { char * ptr_src = src; char * ptr_dst = dst; for( int y = 0; y < height; y++ ) { for( int x = 0; x < WIDTH; x++ ) { int p1 = a + c; int p2 = b + d; char x1 = (p1 * ptr_src[x] ) >> 6; char x2 = (p2 * ptr_src[x+1]) >> 6; ptr_dst[x] = x1 + x2; } ptr_dst += dst_stride; ptr_src += 32; } } Which does vectorize (using Andre's patch for the SUM reductions with sign-change casts). We've seen multiple other cases where doing so would (significantly) improve vectorization and code generation. So perhaps we should try re-rolling the loop or create the smallest (in terms of height) possible SLP tree?