Hello All, I have a question about performance on Power8 (little-endian, GCC 4.9.1) specially with load/store. I evaluate all possibilities to evaluate polynomial, to simplify the thread I provide a basic example of meta-programing about Polynomial evaluation with Horner method. I have:
template<int n> struct coeff{ }; template<> struct coeff<0>{ const static inline double coefficient() {return 1.00000;} }; template<> struct coeff<1>{ const static inline double coefficient() {return 9.99999;} }; template<> struct coeff<2>{ const static inline double coefficient() {return 5.00000;} }; template<> struct coeff<3>{ const static inline double coefficient() {return 1.66666;} }; template<> struct coeff<4>{ const static inline double coefficient() {return 4.16667;} }; template< template<int n> class C, int n> struct horner_helper{ static const inline double h(double x){ return C<4-n>::coefficient() + horner_helper<C,n-1>::h(x)*x; } }; template<template<int n> class C> struct horner_helper<C,0>{ static const inline double h(double){ return C<4>::coefficient(); } }; inline double horner(double x){ return horner_helper<coeff,4>::h(x); } double poly(double x){ double y = horner(x); return y; } If I have a look about the ASM generates by GCC compare XLC I have: MAKEFILE all: make gcc xlc gcc: gcc -O3 -c horner.cpp -o horner.o ar rcs libhorner_gcc.a horner.o rm horner.o xlc: xlc -O3 -c horner.cpp -o horner.o ar rcs libhorner_xlc.a horner.o rm horner.o clean: rm -f horner.o libhorner_gcc.a libhorner_xlc.a GCC 0000000000000000 <_Z4polyd>: 0: 00 00 4c 3c addis r2,r12,0 4: 00 00 42 38 addi r2,r2,0 8: 00 00 22 3d addis r9,r2,0 c: 00 00 89 c9 lfd f12,0(r9) 10: 00 00 22 3d addis r9,r2,0 14: 00 00 29 c9 lfd f9,0(r9) 18: 00 00 22 3d addis r9,r2,0 1c: 00 00 09 c0 lfs f0,0(r9) 20: 00 00 22 3d addis r9,r2,0 24: 00 00 49 c9 lfd f10,0(r9) 28: 00 00 22 3d addis r9,r2,0 2c: 3a 4b 81 fd fmadd f12,f1,f12,f9 30: 00 00 69 c1 lfs f11,0(r9) 34: 3a 03 01 fc fmadd f0,f1,f12,f0 38: 3a 50 01 fc fmadd f0,f1,f0,f10 3c: 3a 58 21 fc fmadd f1,f1,f0,f11 40: 20 00 80 4e blr 44: 00 00 00 00 .long 0x0 48: 00 09 00 00 .long 0x900 4c: 00 00 00 00 .long 0x0 XLC 0: 00 00 4c 3c addis r2,r12,0 4: 00 00 42 38 addi r2,r2,0 8: 00 00 62 3c addis r3,r2,0 c: 00 00 63 e8 ld r3,0(r3) 10: 8c 03 05 10 vspltisw v0,5 14: 8c 03 21 10 vspltisw v1,1 18: 00 00 03 c8 lfd f0,0(r3) 1c: 08 00 43 c8 lfd f2,8(r3) 20: e2 03 60 f0 xvcvsxwdp vs3,vs32 24: 08 09 02 f0 xsmaddadp vs0,vs2,vs1 28: 10 00 43 c8 lfd f2,16(r3) 2c: 08 01 61 f0 xsmaddadp vs3,vs1,vs0 30: e2 0b 00 f0 xvcvsxwdp vs0,vs33 34: 08 19 41 f0 xsmaddadp vs2,vs1,vs3 38: 48 01 22 f0 xsmaddmdp vs1,vs2,vs0 3c: 20 00 80 4e blr 40: 00 00 00 00 .long 0x0 44: 00 09 22 00 .long 0x220900 48: 00 00 00 00 .long 0x0 4c: 40 00 00 00 .long 0x40 The code of XLC is more compact due to direct load (with the offset of the address) contrary to GCC where the address is computed with addis. Moreover XLC privileges VMX computation but I thing it should change nothing. For this kind of computation ( I measure the latency with an external programs) XLC is faster than +- 30% on other test cases (similar). Does this address computation costs extra time (I will say yes and it introduces a data hazard in the pipeline) or does use the Instruction fusion process described in « IBM POWER8 processor core microarchitecture » at running time and so merge addis + ld to ld X(r3)? Best, Tim