https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83952
Bug ID: 83952 Summary: [missed optimization] difference calculation for floats vs ints in a loop Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: eyalroz at technion dot ac.il Target Milestone: --- Created attachment 43195 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43195&action=edit Code exemplifying the issue Consider the following code: template <typename T> void foo(T* __restrict__ a) { int i; T val = 0; for (i = 0; i < 100; i++) { val = 2 * i; a[i] = val; } } template void foo<int>(int* __restrict__ a); template void foo<float>(float* __restrict__ a); (This is based on example 7.26 in Agner Fog's Optimizing Software in C++; but the use of C++ here is immaterial). The int version compiles, with -O2, into: void foo<int>(int*): xor eax, eax .L2: mov DWORD PTR [rdi], eax add eax, 2 add rdi, 4 cmp eax, 200 jne .L2 rep ret One would expect that the float version would compile into something similar, except that instead of rdi we would have a floating-point register, initialized to 0 and incremented by float 2.0 with each iteration. Instead, we get: void foo<float>(float*): xor eax, eax .L6: pxor xmm0, xmm0 add rdi, 4 cvtsi2ss xmm0, eax add eax, 2 movss DWORD PTR [rdi-4], xmm0 cmp eax, 200 jne .L6 rep ret which seems to be much slower. Checked here: https://godbolt.org/g/t8Hvyn