http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47010
Summary: Missed optimization: x86-64 prologue not deleted Product: gcc Version: 4.5.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: schnet...@gmail.com Created attachment 22818 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22818 pre-processed bzipped source code The following code is generated by g++ 4.5.1 on an x86-64 architecture (Mac OS 10.6). This is a static function where g++ may even have modified the argument list. I believe the three instructions "pushq", "movq", and "leave" are not necessary. This routine is called in a compute-intensive inner loop that has problems fitting into the level 1 instruction cache. The disassembled routine is: __ZL20PDstandardNth11_implPKdll.clone.1: 0000000000000140 pushq %rbp 0000000000000141 movupd 0x10(%rdi),%xmm3 0000000000000146 movupd 0xf0(%rdi),%xmm0 000000000000014b movupd 0x08(%rdi),%xmm2 0000000000000150 addpd %xmm3,%xmm0 0000000000000154 movupd 0xf8(%rdi),%xmm1 0000000000000159 movq %rsp,%rbp 000000000000015c addpd %xmm2,%xmm1 0000000000000160 mulpd 0x000a0578(%rip),%xmm1 0000000000000168 addpd %xmm0,%xmm1 000000000000016c movupd (%rdi),%xmm0 0000000000000170 mulpd 0x000a0578(%rip),%xmm0 0000000000000178 leave 0000000000000179 addpd %xmm1,%xmm0 000000000000017d ret The original function is defined as: static CCTK_REAL_VEC PDstandardNth11_impl(CCTK_REAL const* restrict const u, ptrdiff_t const dj, ptrdiff_t const dk) __attribute__((pure)) __attribute__((noinline)) __attribute__((unused)); static CCTK_REAL_VEC PDstandardNth11_impl(CCTK_REAL const* restrict const u, ptrdiff_t const dj, ptrdiff_t const dk) { return kmadd(ToReal(30),vec_loadu_maybe3(0,0,0,(u)[(0)+dj*(0)+dk*(0)]),kmadd(ToReal(-16),kadd(vec_loadu_maybe3(-1,0,0,(u)[(-1)+dj*(0)+dk*(0)]),vec_loadu_maybe3(1,0,0,(u)[(1)+dj*(0)+dk*(0)])),kadd(vec_loadu_maybe3(-2,0,0,(u)[(-2)+dj*(0)+dk*(0)]),vec_loadu_maybe3(2,0,0,(u)[(2)+dj*(0)+dk*(0)])))); } where CCTK_REAL is double, and CCTK_REAL_VEC is __m128d, the SSE2 vector of doubles. The function body contains macros that translate directly to Intel SSE2 vector instructions. The code was compiled with gcc 4.5.1 with the options g++-mp-4.5 -g3 -m128bit-long-double -march=native -std=gnu++0x -O3 -funsafe-loop-optimizations -fsee -ftree-loop-linear -ftree-loop-im -fivopts -fvect-cost-model -funroll-loops -funroll-all-loops -fvariable-expansion-in-unroller -fprefetch-loop-arrays -ffast-math -fassociative-math -freciprocal-math -fno-trapping-math -fexcess-precision=fast -fopenmp -Wall -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Woverloaded-virtual I attach the complete pre-processed and bzipped source code. The source code itself is auto-generated.