https://llvm.org/bugs/show_bug.cgi?id=31866
Bug ID: 31866 Summary: Squaring a complex float gives inefficient code Product: new-bugs Version: trunk Hardware: PC OS: Linux Status: NEW Severity: normal Priority: P Component: new bugs Assignee: unassignedb...@nondot.org Reporter: drr...@gmail.com CC: llvm-bugs@lists.llvm.org Classification: Unclassified Consider: #include <complex.h> complex float f(complex float x) { return x*x; } clang trunk with -O3 -march=core-avx2 gives f: # @f vmovaps xmm2, xmm0 vmovshdup xmm1, xmm2 # xmm1 = xmm2[1,1,3,3] vmulss xmm0, xmm2, xmm2 vmulss xmm3, xmm1, xmm1 vmulss xmm4, xmm2, xmm1 vsubss xmm0, xmm0, xmm3 vaddss xmm3, xmm4, xmm4 vucomiss xmm0, xmm0 jnp .LBB0_3 vucomiss xmm3, xmm3 jp .LBB0_2 .LBB0_3: vinsertps xmm0, xmm0, xmm3, 16 # xmm0 = xmm0[0],xmm3[0],xmm0[2,3] ret .LBB0_2: push rax vmovaps xmm0, xmm2 vmovaps xmm3, xmm1 call __mulsc3 vmovshdup xmm3, xmm0 # xmm3 = xmm0[1,1,3,3] add rsp, 8 vinsertps xmm0, xmm0, xmm3, 16 # xmm0 = xmm0[0],xmm3[0],xmm0[2,3] ret The Intel C compiler with -fp-model strict gives f: vmovsldup xmm1, xmm0 #3.12 vmovshdup xmm2, xmm0 #3.12 vshufps xmm3, xmm0, xmm0, 177 #3.12 vmulps xmm4, xmm1, xmm0 #3.12 vmulps xmm5, xmm2, xmm3 #3.12 vaddsubps xmm0, xmm4, xmm5 #3.12 ret "strict" should be value safe, turn on floating point exception semantics and also disable fuse multiply add. "precise" is the setting to use if you just want it to be value safe. -fp-model precise -fp-model except gives f: vmovshdup xmm1, xmm0 #3.12 vshufps xmm2, xmm0, xmm0, 177 #3.12 vmulps xmm4, xmm1, xmm2 #3.12 vmovsldup xmm3, xmm0 #3.12 vfmaddsub213ps xmm0, xmm3, xmm4 #3.12 ret gcc 7 gives code that is shorter than clang does but still call __mulsc3 . f: vmovq QWORD PTR [rsp-16], xmm0 vmovss xmm3, DWORD PTR [rsp-12] vmovss xmm2, DWORD PTR [rsp-16] vmovaps xmm1, xmm3 vmovaps xmm0, xmm2 jmp __mulsc3 If you enable -ffast-math in clang it is much better although not quite optimal with: f: # @f vmovshdup xmm1, xmm0 # xmm1 = xmm0[1,1,3,3] vaddss xmm2, xmm0, xmm0 vmulss xmm2, xmm1, xmm2 vmulss xmm1, xmm1, xmm1 vfmsub231ss xmm1, xmm0, xmm0 vinsertps xmm0, xmm1, xmm2, 16 # xmm0 = xmm1[0],xmm2[0],xmm1[2,3] ret >From my non-expert eyes it seems there are two questions: 1) In the "no fast-math" case is ICC actually meeting the C99 specs? 2) In the "fast-math" case can clang/llvm be persuaded/changed to use one call to vmulps instead of two calls to vmulss? -- You are receiving this mail because: You are on the CC list for the bug.
_______________________________________________ llvm-bugs mailing list llvm-bugs@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs