I will post the actual patch in the next post.  This part gives the
justification for the patch adding vector-pair.h.

The patch as a followup to this post adds a new include file (vector-pair.h)
that adds support so that users writing high performance libraries can change
their code to allow the generation of the vector pair load and store
instructions on power10.

The intention is that if the library authors need to write special loops that
go over arrays that they could modify their code to use the functions provided
to change loops that can take advantage of the higher bandwidth for load vector
pair and store instructions.

This particular patch just adds a new include file (vector-pair.h) that
provides a bunch of functions that on a power10 system would use the vector
pair load operation, 2 floating point operations, and a vector pair store.  It
does not add any new types, modes, or built-in function.

I have additional patches that can add built-in functions that the functions in
vector-pair.h could utilize so that the compiler can optimize and combine
operations.  I may submit those patches in the future, but I would like to
provide this patch to allow the library writer to optimize their code.

I've measured the performance of these new functions on a power10.  For default
unrolling, the percentage of change for the 3 methods over the normal vector
loop method:

        116%    Vector-pair.h function, default unroll
         93%    Vector pair split built-in & 2 vector stores, default unroll
         86%    Vector pair split & combine built-ins, default unroll

Using explicit 2 way unrolling the numbers are:

        114%    Vector-pair.h function, unroll 2
        106%    Vector pair split built-in & 2 vector stores, unroll 2
         98%    Vector pair split & combine built-ins, unroll 2

These new functions provided in vector-pair.h use the vector pair load/store
instructions, and don't generate extra vector moves.  Using the existing
vector pair disassemble and assemble built-ins generate extra vector moves
which can hinder performance.

If I compile the loop code for power9, there is a minor speed up for default
unrolling and more of an improvement using the framework provided in the
vector-pair.h for explicit unrolling by 2:

        101%    Vector-pair.h function, default unroll for power9
        107%    Vector-pair.h function, unroll 2 for power9

Of course this is a synthetic benchmark run on a quiet power10 system.  Results
would vary for real code on real systems.  However, I feel adding these
functions can allow the writers of high performance libraries to better
optimize their code.

As an example, if the library wants to code a simple fused multiply-add loop,
they might write the code as follows:

        #include <altivec.h>
        #include <math.h>
        #include <stddef.h>

        void
        fma_vector (double * __restrict__ r,
                    const double * __restrict__ a,
                    const double * __restrict__ b,
                    size_t n)
        {
          vector double * __restrict__ vr = (vector double * __restrict__)r;
          const vector double * __restrict__ va = (const vector double * 
__restrict__)a;
          const vector double * __restrict__ vb = (const vector double * 
__restrict__)b;
          size_t num_elements = sizeof (vector double) / sizeof (double);
          size_t nv = n / num_elements;
          size_t i;

          for (i = 0; i < nv; i++)
            vr[i] = __builtin_vsx_xvmadddp (va[i], vb[i], vr[i]);

          for (i = nv * num_elements; i < n; i++)
            r[i] = fma (a[i], b[i], r[i]);
        }

The inner loop would look like:

        .L3:
                lxvx 0,3,9
                lxvx 12,4,9
                addi 10,9,16
                addi 2,2,-2
                lxvx 11,5,9
                xvmaddadp 0,12,11
                lxvx 12,4,10
                lxvx 11,5,10
                stxvx 0,3,9
                lxvx 0,3,10
                addi 9,9,32
                xvmaddadp 0,12,11
                stxvx 0,3,10
                bdnz .L3

Now if you code the loop to use __builtin_vsx_disassemble_pair to do a vector
pair load, but then do 2 vector stores:


        #include <altivec.h>
        #include <math.h>
        #include <stddef.h>

        void
        fma_mma_ld (double * __restrict__ r,
                    const double * __restrict__ a,
                    const double * __restrict__ b,
                    size_t n)
        {
          __vector_pair * __restrict__ vp_r = (__vector_pair * __restrict__)r;
          const __vector_pair * __restrict__ vp_a = (const __vector_pair * 
__restrict__)a;
          const __vector_pair * __restrict__ vp_b = (const __vector_pair * 
__restrict__)b;
          vector double * __restrict__ v_r = (vector double * __restrict__)r;
          size_t num_elements = (sizeof (__vector_pair) / sizeof (double));
          size_t n_vp = n / num_elements;
          size_t i, j;
          vector double a_hi_lo[2];
          vector double b_hi_lo[2];
          vector double r_hi_lo[2];
          vector double result_hi, result_lo;

          j = 0;
          for (i = 0; i < n_vp; i++)
            {
              __builtin_vsx_disassemble_pair (&a_hi_lo[0], &vp_a[i]);
              __builtin_vsx_disassemble_pair (&b_hi_lo[0], &vp_b[i]);
              __builtin_vsx_disassemble_pair (&r_hi_lo[0], &vp_r[i]);

              result_hi = __builtin_vsx_xvmadddp (a_hi_lo[0], b_hi_lo[0], 
r_hi_lo[0]);
              result_lo = __builtin_vsx_xvmadddp (a_hi_lo[1], b_hi_lo[1], 
r_hi_lo[1]);

              v_r[ j+0 ] = result_hi;
              v_r[ j+1 ] = result_lo;
              j += 2;
            }

          for (i = n_vp * num_elements; i < n; i++)
            r[i] = fma (a[i], b[i], r[i]);
        }

And the inner loop would looke like:

        .L72:
                lxvpx 10,4,2
                lxvpx 0,5,2
                lxvpx 12,3,2
                xxlor 8,11,11
                xxlor 11,1,1
                xvmaddmdp 0,10,12
                xvmaddmdp 11,8,13
                stxvx 11,3,2
                stxvx 0,9,2
                addi 2,2,32
                bdnz .L72

I.e. it does 3 vector pair loads, but it adds 2 extra vector moves in the loop.
Also, normal unrolling does not unroll this loop.  But you can use #pragma GCC
unroll 2 to explicitly unroll the loop, and it would generate:

        .L97:
                lxvpx 6,3,2
                addi 9,2,32
                lxvpx 12,4,2
                lxvpx 4,5,2
                lxvpx 8,5,9
                lxvpx 10,3,9
                lxvpx 0,4,9
                xxlor 32,13,13
                xxlor 13,7,7
                xvmaddmdp 12,4,6
                xxlor 7,9,9
                xxlor 9,13,13
                xvmaddmdp 0,8,10
                xvmaddadp 9,5,32
                xvmaddadp 11,7,1
                stxvx 9,3,2
                stxvx 12,10,2
                addi 2,2,64
                stxvx 11,3,9
                stxvx 0,10,9
                bdnz .L97

I.e. it now adds 4 extra vector moves instead of 2,

If you try to do vector pair loads, split the vector pairs into separate
vectors, do the fma, and then combine the two vector resultss back into a
vector pair, the code might look like:

        #include <altivec.h>
        #include <math.h>
        #include <stddef.h>

        void
        fma_mma_ld_st (double * __restrict__ r,
                       const double * __restrict__ a,
                       const double * __restrict__ b,
                       size_t n)
        {
          __vector_pair * __restrict__ vp_r = (__vector_pair * __restrict__)r;
          const __vector_pair * __restrict__ vp_a = (const __vector_pair * 
__restrict__)a;
          const __vector_pair * __restrict__ vp_b = (const __vector_pair * 
__restrict__)b;
          size_t num_elements = (sizeof (__vector_pair) / sizeof (double));
          size_t n_vp = n / num_elements;
          size_t i;
          union vec_alias {
            vector double                vd;
            vector unsigned char         vuc;
          };
          vector double a_hi_lo[2];
          vector double b_hi_lo[2];
          vector double r_hi_lo[2];
          union vec_alias result_hi, result_lo;

          for (i = 0; i < n_vp; i++)
            {
              __builtin_vsx_disassemble_pair (&a_hi_lo[0], &vp_a[i]);
              __builtin_vsx_disassemble_pair (&b_hi_lo[0], &vp_b[i]);
              __builtin_vsx_disassemble_pair (&r_hi_lo[0], &vp_r[i]);

              result_hi.vd = __builtin_vsx_xvmadddp (a_hi_lo[0], b_hi_lo[0], 
r_hi_lo[0]);
              result_lo.vd = __builtin_vsx_xvmadddp (a_hi_lo[1], b_hi_lo[1], 
r_hi_lo[1]);

              __builtin_vsx_build_pair (&vp_r[i], result_hi.vuc, result_lo.vuc);
            }

          for (i = n_vp * num_elements; i < n; i++)
            r[i] = fma (a[i], b[i], r[i]);
        }

The inner loop would look like:

        .L128:
                lxvpx 10,4,2
                lxvpx 0,5,2
                lxvpx 12,3,2
                xxlor 9,10,10
                xxlor 10,11,11
                xxlor 11,1,1
                xvmaddmdp 0,9,12
                xvmaddmdp 11,10,13
                xxlor 12,0,0
                xxlor 13,11,11
                stxvpx 12,3,2
                addi 2,2,32
                bdnz .L128

I.e. there are now 3 extra vector moves after the load vector pair instruction,
and 2 vector moves to combine the vector back into a vector pair.

If you use an explicit #pragma GCC unroll 2, the code generated would be:

        .L153:
                lxvpx 10,3,2
                addi 9,2,32
                lxvpx 6,4,2
                lxvpx 8,5,2
                lxvpx 12,5,9
                lxvpx 0,4,9
                xxlor 3,11,11
                xxlor 5,6,6
                xxlor 6,7,7
                xxlor 7,9,9
                xxlor 11,12,12
                xxlor 12,3,3
                xvmaddadp 10,5,8
                xxlor 9,13,13
                xvmaddadp 12,7,6
                xxlor 6,10,10
                xxlor 7,12,12
                stxvpx 6,3,2
                addi 2,2,64
                lxvpx 12,3,9
                xxlor 10,12,12
                xxlor 12,13,13
                xvmaddmdp 0,11,10
                xvmaddadp 12,9,1
                xxlor 10,0,0
                xxlor 11,12,12
                stxvpx 10,3,9
                bdnz .L153

Finally, if you recode the loop to use the vpair_f64_fma function in this
patch, the code would look like:

        #include <altivec.h>
        #include <math.h>
        #include <vector-pair.h>
        #include <stddef.h>

        void
        fma_vpair (double * __restrict__ r,
                   const double * __restrict__ a,
                   const double * __restrict__ b,
                   size_t n)
        {
          vector_pair_f64_t * __restrict__ vp_r = (vector_pair_f64_t * 
__restrict__)r;
          const vector_pair_f64_t * __restrict__ vp_a = (const 
vector_pair_f64_t * __restrict__)a;
          const vector_pair_f64_t * __restrict__ vp_b = (const 
vector_pair_f64_t * __restrict__)b;
          size_t num_elements = (sizeof (vector_pair_f64_t) / sizeof (double));
          size_t n_vp = n / num_elements;
          size_t i;

          for (i = 0; i < n_vp; i++)
            vpair_f64_fma (&vp_r[i], &vp_a[i], &vp_b[i], &vp_r[i]);

          for (i = n_vp * num_elements; i < n; i++)
            r[i] = fma (a[i], b[i], r[i]);
        }

The inner loop would generate:

        .L184:
                addi 9,2,32
                lxvpx 0,3,2
                lxvpx 8,4,2
                lxvpx 6,5,2
                lxvpx 12,4,9
                lxvpx 10,5,9
        #APP
         # 437 "./include/vector-pair.h" 1
                xvmaddadp 0,8,6
                xvmaddadp 0+1,8+1,6+1
         # 0 "" 2
        #NO_APP
                stxvpx 0,3,2
                addi 2,2,64
                lxvpx 0,3,9
        #APP
         # 437 "./include/vector-pair.h" 1
                xvmaddadp 0,12,10
                xvmaddadp 0+1,12+1,10+1
         # 0 "" 2
        #NO_APP
                stxvpx 0,3,9
                bdnz .L184

I.e. there are no extra vector moves in this loop, and normal unrolling does
duplicate this loop.

The vector-pair.h include file provides support if the code is compiled on
previous VSX systems that don't have the vector pair load/store instructions.
This allows the library writer to use the same code on both power9 and power10
systems, without have to use #ifdef operations.  On a power9, the code
generated would be:

        .L66:
                lxvx 0,3,9
                lxvx 12,4,9
                lxvx 11,5,9
                xvmaddadp 0,12,11
                lxvx 12,7,9
                lxvx 11,8,9
                stxvx 0,3,9
                lxvx 0,10,9
                xvmaddadp 0,12,11
                stxvx 0,10,9
                addi 9,9,32
                bdnz .L66

With an explicit #pragma GCC unroll 2, the code generated would be:

        .L93:
                lxvx 0,3,9
                lxvx 12,4,9
                addi 10,9,32
                lxvx 11,5,9
                xvmaddadp 0,12,11
                lxvx 12,7,9
                lxvx 11,11,9
                stxvx 0,3,9
                lxvx 0,8,9
                xvmaddadp 0,12,11
                lxvx 12,4,10
                lxvx 11,5,10
                stxvx 0,8,9
                addi 9,9,64
                lxvx 0,3,10
                xvmaddadp 0,12,11
                lxvx 12,7,10
                lxvx 11,11,10
                stxvx 0,3,10
                lxvx 0,8,10
                xvmaddadp 0,12,11
                stxvx 0,8,10
                bdnz .L93

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com

Reply via email to