https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119181
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Hongtao Liu from comment #10) > But it still can't fix the issue with > > void > foo (int* a, int* restrict b) > { > b[0] = a[0] * a[8]; > b[1] = a[1] * a[9]; > b[2] = a[2] * a[10]; > b[3] = a[11] * a[3]; > b[4] = a[12] * a[4]; > b[5] = a[5] * a[13]; > b[6] = a[6] * a[14]; > b[7] = a[7] * a[15]; > } > > -O2 -mavx2 > > foo: > vmovdqu ymm0, YMMWORD PTR [rdi] > vmovdqu ymm2, YMMWORD PTR [rdi+32] > vpblendd ymm1, ymm2, ymm0, 231 > vpblendd ymm0, ymm0, ymm2, 231 > vpmulld ymm0, ymm1, ymm0 > vmovdqu YMMWORD PTR [rsi], ymm0 > vzeroupper > ret > > There's 2 redundant vpblendd here. Yes, which is why I didn't try splitting groups - the most practical cases will not have a large constant gap. Instead this asks for a optimization phase on the SLP tree, possibly part of permute optimizations. For vector code as in comment #11 this could be optimized by either a match.pd pattern or by forwprop. Note it could be deeper in an expression tree, like permute * (x + permute), where eliding two permutes in exchange for an additional permute on 'x' might pay off, this shouldn't be done with match.pd or simple pattern matching but would ask for some kind of propagation pass (like we do in SLP permute optimization).