I will post the actual patch in the next post. This part gives the justification for the patch adding vector-pair.h.
The patch as a followup to this post adds a new include file (vector-pair.h) that adds support so that users writing high performance libraries can change their code to allow the generation of the vector pair load and store instructions on power10. The intention is that if the library authors need to write special loops that go over arrays that they could modify their code to use the functions provided to change loops that can take advantage of the higher bandwidth for load vector pair and store instructions. This particular patch just adds a new include file (vector-pair.h) that provides a bunch of functions that on a power10 system would use the vector pair load operation, 2 floating point operations, and a vector pair store. It does not add any new types, modes, or built-in function. I have additional patches that can add built-in functions that the functions in vector-pair.h could utilize so that the compiler can optimize and combine operations. I may submit those patches in the future, but I would like to provide this patch to allow the library writer to optimize their code. I've measured the performance of these new functions on a power10. For default unrolling, the percentage of change for the 3 methods over the normal vector loop method: 116% Vector-pair.h function, default unroll 93% Vector pair split built-in & 2 vector stores, default unroll 86% Vector pair split & combine built-ins, default unroll Using explicit 2 way unrolling the numbers are: 114% Vector-pair.h function, unroll 2 106% Vector pair split built-in & 2 vector stores, unroll 2 98% Vector pair split & combine built-ins, unroll 2 These new functions provided in vector-pair.h use the vector pair load/store instructions, and don't generate extra vector moves. Using the existing vector pair disassemble and assemble built-ins generate extra vector moves which can hinder performance. If I compile the loop code for power9, there is a minor speed up for default unrolling and more of an improvement using the framework provided in the vector-pair.h for explicit unrolling by 2: 101% Vector-pair.h function, default unroll for power9 107% Vector-pair.h function, unroll 2 for power9 Of course this is a synthetic benchmark run on a quiet power10 system. Results would vary for real code on real systems. However, I feel adding these functions can allow the writers of high performance libraries to better optimize their code. As an example, if the library wants to code a simple fused multiply-add loop, they might write the code as follows: #include <altivec.h> #include <math.h> #include <stddef.h> void fma_vector (double * __restrict__ r, const double * __restrict__ a, const double * __restrict__ b, size_t n) { vector double * __restrict__ vr = (vector double * __restrict__)r; const vector double * __restrict__ va = (const vector double * __restrict__)a; const vector double * __restrict__ vb = (const vector double * __restrict__)b; size_t num_elements = sizeof (vector double) / sizeof (double); size_t nv = n / num_elements; size_t i; for (i = 0; i < nv; i++) vr[i] = __builtin_vsx_xvmadddp (va[i], vb[i], vr[i]); for (i = nv * num_elements; i < n; i++) r[i] = fma (a[i], b[i], r[i]); } The inner loop would look like: .L3: lxvx 0,3,9 lxvx 12,4,9 addi 10,9,16 addi 2,2,-2 lxvx 11,5,9 xvmaddadp 0,12,11 lxvx 12,4,10 lxvx 11,5,10 stxvx 0,3,9 lxvx 0,3,10 addi 9,9,32 xvmaddadp 0,12,11 stxvx 0,3,10 bdnz .L3 Now if you code the loop to use __builtin_vsx_disassemble_pair to do a vector pair load, but then do 2 vector stores: #include <altivec.h> #include <math.h> #include <stddef.h> void fma_mma_ld (double * __restrict__ r, const double * __restrict__ a, const double * __restrict__ b, size_t n) { __vector_pair * __restrict__ vp_r = (__vector_pair * __restrict__)r; const __vector_pair * __restrict__ vp_a = (const __vector_pair * __restrict__)a; const __vector_pair * __restrict__ vp_b = (const __vector_pair * __restrict__)b; vector double * __restrict__ v_r = (vector double * __restrict__)r; size_t num_elements = (sizeof (__vector_pair) / sizeof (double)); size_t n_vp = n / num_elements; size_t i, j; vector double a_hi_lo[2]; vector double b_hi_lo[2]; vector double r_hi_lo[2]; vector double result_hi, result_lo; j = 0; for (i = 0; i < n_vp; i++) { __builtin_vsx_disassemble_pair (&a_hi_lo[0], &vp_a[i]); __builtin_vsx_disassemble_pair (&b_hi_lo[0], &vp_b[i]); __builtin_vsx_disassemble_pair (&r_hi_lo[0], &vp_r[i]); result_hi = __builtin_vsx_xvmadddp (a_hi_lo[0], b_hi_lo[0], r_hi_lo[0]); result_lo = __builtin_vsx_xvmadddp (a_hi_lo[1], b_hi_lo[1], r_hi_lo[1]); v_r[ j+0 ] = result_hi; v_r[ j+1 ] = result_lo; j += 2; } for (i = n_vp * num_elements; i < n; i++) r[i] = fma (a[i], b[i], r[i]); } And the inner loop would looke like: .L72: lxvpx 10,4,2 lxvpx 0,5,2 lxvpx 12,3,2 xxlor 8,11,11 xxlor 11,1,1 xvmaddmdp 0,10,12 xvmaddmdp 11,8,13 stxvx 11,3,2 stxvx 0,9,2 addi 2,2,32 bdnz .L72 I.e. it does 3 vector pair loads, but it adds 2 extra vector moves in the loop. Also, normal unrolling does not unroll this loop. But you can use #pragma GCC unroll 2 to explicitly unroll the loop, and it would generate: .L97: lxvpx 6,3,2 addi 9,2,32 lxvpx 12,4,2 lxvpx 4,5,2 lxvpx 8,5,9 lxvpx 10,3,9 lxvpx 0,4,9 xxlor 32,13,13 xxlor 13,7,7 xvmaddmdp 12,4,6 xxlor 7,9,9 xxlor 9,13,13 xvmaddmdp 0,8,10 xvmaddadp 9,5,32 xvmaddadp 11,7,1 stxvx 9,3,2 stxvx 12,10,2 addi 2,2,64 stxvx 11,3,9 stxvx 0,10,9 bdnz .L97 I.e. it now adds 4 extra vector moves instead of 2, If you try to do vector pair loads, split the vector pairs into separate vectors, do the fma, and then combine the two vector resultss back into a vector pair, the code might look like: #include <altivec.h> #include <math.h> #include <stddef.h> void fma_mma_ld_st (double * __restrict__ r, const double * __restrict__ a, const double * __restrict__ b, size_t n) { __vector_pair * __restrict__ vp_r = (__vector_pair * __restrict__)r; const __vector_pair * __restrict__ vp_a = (const __vector_pair * __restrict__)a; const __vector_pair * __restrict__ vp_b = (const __vector_pair * __restrict__)b; size_t num_elements = (sizeof (__vector_pair) / sizeof (double)); size_t n_vp = n / num_elements; size_t i; union vec_alias { vector double vd; vector unsigned char vuc; }; vector double a_hi_lo[2]; vector double b_hi_lo[2]; vector double r_hi_lo[2]; union vec_alias result_hi, result_lo; for (i = 0; i < n_vp; i++) { __builtin_vsx_disassemble_pair (&a_hi_lo[0], &vp_a[i]); __builtin_vsx_disassemble_pair (&b_hi_lo[0], &vp_b[i]); __builtin_vsx_disassemble_pair (&r_hi_lo[0], &vp_r[i]); result_hi.vd = __builtin_vsx_xvmadddp (a_hi_lo[0], b_hi_lo[0], r_hi_lo[0]); result_lo.vd = __builtin_vsx_xvmadddp (a_hi_lo[1], b_hi_lo[1], r_hi_lo[1]); __builtin_vsx_build_pair (&vp_r[i], result_hi.vuc, result_lo.vuc); } for (i = n_vp * num_elements; i < n; i++) r[i] = fma (a[i], b[i], r[i]); } The inner loop would look like: .L128: lxvpx 10,4,2 lxvpx 0,5,2 lxvpx 12,3,2 xxlor 9,10,10 xxlor 10,11,11 xxlor 11,1,1 xvmaddmdp 0,9,12 xvmaddmdp 11,10,13 xxlor 12,0,0 xxlor 13,11,11 stxvpx 12,3,2 addi 2,2,32 bdnz .L128 I.e. there are now 3 extra vector moves after the load vector pair instruction, and 2 vector moves to combine the vector back into a vector pair. If you use an explicit #pragma GCC unroll 2, the code generated would be: .L153: lxvpx 10,3,2 addi 9,2,32 lxvpx 6,4,2 lxvpx 8,5,2 lxvpx 12,5,9 lxvpx 0,4,9 xxlor 3,11,11 xxlor 5,6,6 xxlor 6,7,7 xxlor 7,9,9 xxlor 11,12,12 xxlor 12,3,3 xvmaddadp 10,5,8 xxlor 9,13,13 xvmaddadp 12,7,6 xxlor 6,10,10 xxlor 7,12,12 stxvpx 6,3,2 addi 2,2,64 lxvpx 12,3,9 xxlor 10,12,12 xxlor 12,13,13 xvmaddmdp 0,11,10 xvmaddadp 12,9,1 xxlor 10,0,0 xxlor 11,12,12 stxvpx 10,3,9 bdnz .L153 Finally, if you recode the loop to use the vpair_f64_fma function in this patch, the code would look like: #include <altivec.h> #include <math.h> #include <vector-pair.h> #include <stddef.h> void fma_vpair (double * __restrict__ r, const double * __restrict__ a, const double * __restrict__ b, size_t n) { vector_pair_f64_t * __restrict__ vp_r = (vector_pair_f64_t * __restrict__)r; const vector_pair_f64_t * __restrict__ vp_a = (const vector_pair_f64_t * __restrict__)a; const vector_pair_f64_t * __restrict__ vp_b = (const vector_pair_f64_t * __restrict__)b; size_t num_elements = (sizeof (vector_pair_f64_t) / sizeof (double)); size_t n_vp = n / num_elements; size_t i; for (i = 0; i < n_vp; i++) vpair_f64_fma (&vp_r[i], &vp_a[i], &vp_b[i], &vp_r[i]); for (i = n_vp * num_elements; i < n; i++) r[i] = fma (a[i], b[i], r[i]); } The inner loop would generate: .L184: addi 9,2,32 lxvpx 0,3,2 lxvpx 8,4,2 lxvpx 6,5,2 lxvpx 12,4,9 lxvpx 10,5,9 #APP # 437 "./include/vector-pair.h" 1 xvmaddadp 0,8,6 xvmaddadp 0+1,8+1,6+1 # 0 "" 2 #NO_APP stxvpx 0,3,2 addi 2,2,64 lxvpx 0,3,9 #APP # 437 "./include/vector-pair.h" 1 xvmaddadp 0,12,10 xvmaddadp 0+1,12+1,10+1 # 0 "" 2 #NO_APP stxvpx 0,3,9 bdnz .L184 I.e. there are no extra vector moves in this loop, and normal unrolling does duplicate this loop. The vector-pair.h include file provides support if the code is compiled on previous VSX systems that don't have the vector pair load/store instructions. This allows the library writer to use the same code on both power9 and power10 systems, without have to use #ifdef operations. On a power9, the code generated would be: .L66: lxvx 0,3,9 lxvx 12,4,9 lxvx 11,5,9 xvmaddadp 0,12,11 lxvx 12,7,9 lxvx 11,8,9 stxvx 0,3,9 lxvx 0,10,9 xvmaddadp 0,12,11 stxvx 0,10,9 addi 9,9,32 bdnz .L66 With an explicit #pragma GCC unroll 2, the code generated would be: .L93: lxvx 0,3,9 lxvx 12,4,9 addi 10,9,32 lxvx 11,5,9 xvmaddadp 0,12,11 lxvx 12,7,9 lxvx 11,11,9 stxvx 0,3,9 lxvx 0,8,9 xvmaddadp 0,12,11 lxvx 12,4,10 lxvx 11,5,10 stxvx 0,8,9 addi 9,9,64 lxvx 0,3,10 xvmaddadp 0,12,11 lxvx 12,7,10 lxvx 11,11,10 stxvx 0,3,10 lxvx 0,8,10 xvmaddadp 0,12,11 stxvx 0,8,10 bdnz .L93 -- Michael Meissner, IBM PO Box 98, Ayer, Massachusetts, USA, 01432 email: meiss...@linux.ibm.com