https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115709
Bug ID: 115709 Summary: missed optimisation: vperms not reordered to eliminate Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: mjr19 at cam dot ac.uk Target Milestone: --- #include <complex.h> void foo(double complex *a, double *b, int n){ int i; for(i=0;i<n;i++) b[i]=creal(a[i])*creal(a[i])+cimag(a[i])*cimag(a[i]); } with "gcc-14 -mavx2 -mfma -Ofast" produces a loop which ends vpermpd $216, %ymm0, %ymm0 vpermpd $216, %ymm1, %ymm1 vmulpd %ymm0, %ymm0, %ymm0 vfmadd132pd %ymm1, %ymm0, %ymm1 vmovupd %ymm1, (%rsi,%rax) However, if the two identical vperms were delayed until after the vmul and vfmadd, then just one on ymm1 would be needed. I believe that vmulpd %ymm0, %ymm0, %ymm0 vfmadd132pd %ymm1, %ymm0, %ymm1 vpermpd $216, %ymm1, %ymm1 vmovupd %ymm1, (%rsi,%rax) is equivalent, given that the contents of ymm0 are not used again. subroutine foo(a,b,n) complex(kind(1d0))::a(*) real(kind(1d0))::b(*) integer::i,n do i=1,n b(i)=real(a(i))*real(a(i))+aimag(a(i))*aimag(a(i)) end do end subroutine foo has the same issue. The speed increase from eliminating one vperm is quite measurable.