Hello All,

I have a question about performance on Power8 (little-endian, GCC 4.9.1) 
specially with load/store.
I evaluate all possibilities to evaluate polynomial, to simplify the thread I 
provide a basic example 
of  meta-programing about Polynomial evaluation with Horner method. I have:

template<int n>
struct coeff{ };

template<>
struct coeff<0>{
    const static inline double coefficient() {return 1.00000;}
};

template<>
struct coeff<1>{
    const static inline double coefficient() {return 9.99999;}
};

template<>
struct coeff<2>{
    const static inline double coefficient() {return 5.00000;}
};

template<>
struct coeff<3>{
    const static inline double coefficient() {return 1.66666;}
};

template<>
struct coeff<4>{
    const static inline double coefficient() {return 4.16667;}
};

template< template<int n> class C, int n>
struct horner_helper{
    static const inline double h(double x){
        return C<4-n>::coefficient() + horner_helper<C,n-1>::h(x)*x;
    }
};

template<template<int n> class C>
struct horner_helper<C,0>{
    static const inline double h(double){
        return C<4>::coefficient();
    }
};

inline double horner(double x){
    return horner_helper<coeff,4>::h(x);
}

double poly(double x){
    double y = horner(x);
    return y;
}

If I have a look about the ASM generates by GCC compare XLC I have:

MAKEFILE

all:
        make gcc xlc

gcc:
        gcc -O3 -c horner.cpp -o horner.o
        ar rcs libhorner_gcc.a horner.o
        rm horner.o
xlc:
        xlc -O3 -c horner.cpp -o horner.o
        ar rcs libhorner_xlc.a horner.o
        rm horner.o
clean:
        rm -f horner.o libhorner_gcc.a libhorner_xlc.a

GCC 

0000000000000000 <_Z4polyd>:
   0:   00 00 4c 3c     addis   r2,r12,0
   4:   00 00 42 38     addi    r2,r2,0
   8:   00 00 22 3d     addis   r9,r2,0
   c:   00 00 89 c9     lfd     f12,0(r9)
  10:   00 00 22 3d     addis   r9,r2,0
  14:   00 00 29 c9     lfd     f9,0(r9)
  18:   00 00 22 3d     addis   r9,r2,0
  1c:   00 00 09 c0     lfs     f0,0(r9)
  20:   00 00 22 3d     addis   r9,r2,0
  24:   00 00 49 c9     lfd     f10,0(r9)
  28:   00 00 22 3d     addis   r9,r2,0
  2c:   3a 4b 81 fd     fmadd   f12,f1,f12,f9
  30:   00 00 69 c1     lfs     f11,0(r9)
  34:   3a 03 01 fc     fmadd   f0,f1,f12,f0
  38:   3a 50 01 fc     fmadd   f0,f1,f0,f10
  3c:   3a 58 21 fc     fmadd   f1,f1,f0,f11
  40:   20 00 80 4e     blr
  44:   00 00 00 00     .long 0x0
  48:   00 09 00 00     .long 0x900
  4c:   00 00 00 00     .long 0x0

XLC 

   0:   00 00 4c 3c     addis   r2,r12,0
   4:   00 00 42 38     addi    r2,r2,0
   8:   00 00 62 3c     addis   r3,r2,0
   c:   00 00 63 e8     ld      r3,0(r3)
  10:   8c 03 05 10     vspltisw v0,5
  14:   8c 03 21 10     vspltisw v1,1
  18:   00 00 03 c8     lfd     f0,0(r3)
  1c:   08 00 43 c8     lfd     f2,8(r3)
  20:   e2 03 60 f0     xvcvsxwdp vs3,vs32
  24:   08 09 02 f0     xsmaddadp vs0,vs2,vs1
  28:   10 00 43 c8     lfd     f2,16(r3)
  2c:   08 01 61 f0     xsmaddadp vs3,vs1,vs0
  30:   e2 0b 00 f0     xvcvsxwdp vs0,vs33
  34:   08 19 41 f0     xsmaddadp vs2,vs1,vs3
  38:   48 01 22 f0     xsmaddmdp vs1,vs2,vs0
  3c:   20 00 80 4e     blr
  40:   00 00 00 00     .long 0x0
  44:   00 09 22 00     .long 0x220900
  48:   00 00 00 00     .long 0x0
  4c:   40 00 00 00     .long 0x40


The code of XLC is more compact due to direct load (with the offset of the 
address) contrary 
to GCC where the address is computed with addis. Moreover XLC privileges VMX 
computation
but I thing it should change nothing.  For this kind of computation  
( I measure the latency with an external programs) XLC is faster than +- 30% on 
other test cases (similar).

Does this address computation costs extra time (I will say yes and it 
introduces a data hazard in the pipeline) or does use the Instruction fusion 
process
described in « IBM POWER8 processor core microarchitecture » at running time 
and so merge addis + ld to ld X(r3)?

Best,

Tim








Reply via email to