https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152
--- Comment #7 from Robin Dapp <rdapp at gcc dot gnu.org> --- The vcompress code looks like a "costing" issue. The loop is now cheaper than it was in 15 which makes us choose it in 16 while we rejected it before. I'll see if we can do something in the target here as a bandaid, like making the permute more expensive if we know it takes 2+ instructions. Ah, of course, another improvement for (3) would be to let the vectorizer use strided loads even for appropriate contiguous access patterns.
