https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83951
Bug ID: 83951 Summary: [missed optimization] difference calculation for floats vs ints in a loop Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: eyalroz at technion dot ac.il Target Milestone: --- Consider the following code: template <typename T> int foo(T* __restrict__ a) { int i; T val = 0; for (i = 0; i < 100; i++) { val = 2 * i; a[i] = val; } } template int foo<int>(int* __restrict__ a); template int foo<float>(float* __restrict__ a); (This is based on example 7.26 in Agner Fog's Optimizing Software in C++; but the use of C++ here is immaterial). The int version compiles, with -O2, into: foo(int*): xor eax, eax .L2: mov DWORD PTR [rdi], eax add eax, 2 add rdi, 4 cmp eax, 200 jne .L2 rep ret One would expect that the float version would compile into something similar, except that instead of rdi we would have a floating-point register, initialized to 0 and incremented by float 2.0 with each iteration. Instead, we get: int foo<float>(float*): xor eax, eax .L6: pxor xmm0, xmm0 add rdi, 4 cvtsi2ss xmm0, eax add eax, 2 movss DWORD PTR [rdi-4], xmm0 cmp eax, 200 jne .L6 rep ret which seems to be much slower. Checked here: https://godbolt.org/g/RVBNyY