https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115709
--- Comment #3 from mjr19 at cam dot ac.uk --- Created attachment 58558 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58558&action=edit Demo of effect of vperm rearrangement I still believe that my code is correct. To make what I propose clearer, I attach a runnable demo, which checks itself. Whether the optimisation is easy enough to be worthwhile, or whether it would generalise to other cases, is another matter. On a Kaby Lake the optimised version is about 20% faster, but on a Haswell it is only about 7% faster.