http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57988
Bug ID: 57988 Summary: missed optimization vxorpd before vcvtsi2sdq Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: dushistov at mail dot ru I tested such simple function on i7-3740QM CPU @ 2.70GHz, with gcc 4.8.1 and gcc 4.9.0 20130725: double pi(unsigned int count) { unsigned int i; double p = 0; double z = 1; for (i = 1; i < count; i+=2) { p += z * 4 / i; z *= -1; } return p; } gcc(-Ofast -march=native -std=c99) convert cycle to such code: ... 30: mov %eax,%edx vmulsd %xmm5,%xmm1,%xmm3 add $0x2,%eax vcvtsi2sd %rdx,%xmm2,%xmm2 cmp %eax,%edi vxorpd %xmm4,%xmm1,%xmm1 vdivsd %xmm2,%xmm3,%xmm2 vaddsd %xmm2,%xmm0,%xmm0 ja 30 avereage time 0.03sec if call like this pi(10000000), if replace line "vcvtsi2sd %rdx,%xmm2,%xmm2" with two lines: vxorpd %xmm2,%xmm2,%xmm2 vcvtsi2sd %rdx,%xmm2,%xmm2 then average time will be 0.011-0.013 secs, near 3 times faster. PS icc generate such cycle: 22: vxorpd %xmm5,%xmm5,%xmm5 vcvtsi2sd %rax,%xmm5,%xmm5 vmulsd %xmm2,%xmm1,%xmm4 vsubsd %xmm2,%xmm3,%xmm2 vdivsd %xmm5,%xmm4,%xmm6 add $0x2,%eax vaddsd %xmm6,%xmm0,%xmm0 cmp %edi,%eax jb 22 and average time 0.013sec