On Mon, Nov 02, 2015 at 02:59:37PM +0000, Ewart Timothée wrote: > Hello All, > > I have a question about performance on Power8 (little-endian, GCC 4.9.1) > specially with load/store. > I evaluate all possibilities to evaluate polynomial, to simplify the thread I > provide a basic example > of meta-programing about Polynomial evaluation with Horner method. I have:
... > The code of XLC is more compact due to direct load (with the offset of the > address) contrary > to GCC where the address is computed with addis. Moreover XLC privileges VMX > computation > but I thing it should change nothing. For this kind of computation > ( I measure the latency with an external programs) XLC is faster than +- 30% > on other test cases (similar). > > Does this address computation costs extra time (I will say yes and it > introduces a data hazard in the pipeline) or does use the Instruction fusion > process described in « IBM POWER8 processor core microarchitecture » at > running time and so merge addis + ld to ld X(r3)? The power8 machine fusion does not cover fusing addis with the lfd/lfs instructions. Presently, it has 2 cases where instructions are fused: 1) If you have an addis instruction that sets a register, followed by a zero-extending load to the same general purpose register; 2) If you have an or immediate instruction that loads up an integer constant followed by a vector load instruction that uses the constant as an index register. Future machines may expand upon the list of fusable instructions. -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797