http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57954
Bug ID: 57954 Summary: AVX missing vxorps (zeroing) before vcvtsi2s %edx, slow down AVX code Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch in the following benchmark performances w/o vectorization are poor wrt to expectations I find out this is due to non zeroing a register before using it c++ -O2 -S polyAVX.cpp -mavx as -v --64 -o polyAVX.o polyAVX.s GNU assembler version 2.23.1 (x86_64-redhat-linux-gnu) using BFD version (GNU Binutils) 2.23.1 c++ -O2 polyAVX.o -march=corei7-avx ; time ./a.out 53896530759 15.418u 0.000s 0:15.43 99.8% 0+0k 0+0io 1pf+0w patch polyAVX.s 49a50 > vxorps %xmm0,%xmm0,%xmm0 patching file polyAVX.s as -v --64 -o polyAVX.o polyAVX.s GNU assembler version 2.23.1 (x86_64-redhat-linux-gnu) using BFD version (GNU Binutils) 2.23.1 c++ -O2 polyAVX.o -march=corei7-avx ; time ./a.out 10340756863 2.958u 0.000s 0:02.96 99.6% 0+0k 0+0io 1pf+0w I am sure there are many other cases like this. gcc version 4.9.0 20130718 (experimental) [trunk revision 201034] (GCC) cat polyAVX.cpp //template<typename T> typedef float T; inline T polyHorner(T y) { return T(0x2.p0) + y * (T(0x2.p0) + y * (T(0x1.p0) + y * (T(0x5.55523p-4) + y * (T(0x1.5554dcp-4) + y * (T(0x4.48f41p-8) + y * T(0xb.6ad4p-12)))))) ; } #include <x86intrin.h> #include<iostream> volatile unsigned long long rdtsc() { unsigned int taux=0; return __rdtscp(&taux); } int main() { long long t=0; bool ret=true; float s =0; for (int k=0; k!=100; ++k) { float c = 1.f/10000000.f; t -=rdtsc(); for (int i=1; i<10000001; ++i) s+= polyHorner((float(i)+float(k))*c); t +=rdtsc(); } ret &= s!=0; std::cout << t <<std::endl; return ret ? 0 : -1; }