https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767
--- Comment #8 from mjr19 at cam dot ac.uk --- If it is tricky to teach gfortran that it can flip the signs of alternate elements in a vector trivially with an xor, would a possible step to an improvement be to teach it that the cost of vpermpd (as opposed to vpermilpd) is high on most Intel processors (3 cycle latency, one cycle throughput, just one functional unit), and therefore the "optimisation" of using several vperms to save the odd vadd or vmul is a step backwards, not forwards? The cost model seems to be wrong, in that there are several cases where -ffast-math makes things slower on all Intel CPUs to which I have access, including when I set -march=native. In this particularly bad case, -ffast-math adds about 65% to the runtime on a Kaby Lake.