https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115709

--- Comment #3 from mjr19 at cam dot ac.uk ---
Created attachment 58558
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58558&action=edit
Demo of effect of vperm rearrangement

I still believe that my code is correct. To make what I propose clearer, I
attach a runnable demo, which checks itself.

Whether the optimisation is easy enough to be worthwhile, or whether it would
generalise to other cases, is another matter. On a Kaby Lake the optimised
version is about 20% faster, but on a Haswell it is only about 7% faster.

Reply via email to