How to extend SLP to support this case

Kewen.Lin Mon, 09 Mar 2020 23:53:17 -0700

Hi all,

I'm investigating whether GCC can vectorize the below case on ppc64le.


  extern void test(unsigned int t[4][4]);

  void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
  {
    unsigned int tmp[4][4];
    unsigned int a0, a1, a2, a3;

    for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
      a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
      a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
      a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
      a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);

      int t0 = a0 + a1;
      int t1 = a0 - a1;
      int t2 = a2 + a3;
      int t3 = a2 - a3;

      tmp[i][0] = t0 + t2;
      tmp[i][2] = t0 - t2;
      tmp[i][1] = t1 + t3;
      tmp[i][3] = t1 - t3;
    }
    test(tmp);
  }

With unlimited costs, I saw loop aware SLP can vectorize it but with very
inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0}, 
{a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
for tmp* and t*.  It means it's unable to make the isomorphic group for
a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
recognize over_widening pattern and do some parallel for two a0 from two
iterations, but it's still inefficient (high cost).

In this context, it looks better to build <a0, a1, a2, a3> first by
leveraging isomorphic computation trees constructing them, eg:
  w1_0123 = load_word(p1)
  V1_0123 = construct_vec(w1_0123)
  w1_4567 = load_word(p1 + 4)
  V1_4567 = construct_vec(w1_4567)
  w2_0123 = load_word(p2)
  V2_0123 = construct_vec(w2_0123)
  w2_4567 = load_word(p2 + 4)
  V2_4567 = construct_vec(w2_4567)
  V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16

But how to teach it to be aware of this? Currently the processing starts
from bottom to up (from stores), can we do some analysis on the SLP 
instance, detect some pattern and update the whole instance?

Besides, I also tried whether the standalone SLP pass can handle it,
it failed to. It stops at a* due to incompatible vector types.

The optimal vectorized SLP codes can be:
  // p1 byte 0 to byte 7 
  d1_0_7 = load_dword(p1)
  // p1+i1 b0 to b7, rename it as 8 to 15       
  d1_8_15 = load_dword(p1 + i1)
  d1_16_23 = load_dword(p1 + 2*i1) 
  d1_24_31 = load_dword(p1 + 3*i1)

  V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
  V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
  V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
  V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})

  // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all

  // Do the subtraction together (all 4x4 bytes)
  V_sub1 = V_d1_0_3_all - V_d2_0_3_all
  V_sub2 = V_d1_4_7_all - V_d2_4_7_all
  
  // Do some unpack and get the promoted vector int
  V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
  V_a0_1 = V_a0_tmp << 16
  V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
  // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
  V_a0 = V_a0_0 + V_a0_1
  
  // Get the similar for V_a1, V_a2, V_a3

  // Compute t0/t1/t2/t3
  // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
  V_t0 = V_a0 + V_a1  
  V_t1 = V_a0 - V_a1
  V_t2 = V_a2 + V_a3
  V_t3 = V_a2 - V_a3

  // Compute tmps
  // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
  V_tmp0 = V_t0 + V_t2
  V_tmp2 = V_t0 - V_t2
  V_tmp1 = V_t1 + V_t3
  V_tmp3 = V_t1 - V_t3
    
  // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
  // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3

>From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
  tmp[i][0] i=/0,1,2,3/ (one group)
  tmp[i][1] i=/0,1,2,3/ (one group)
  tmp[i][2] i=/0,1,2,3/ (one group)
  tmp[i][3] i=/0,1,2,3/ (one group)

which tmp[i][j] group have the same isomorphic computations.  But currently
SLP is unable to divide group like this way. (call it as A-way for now)

It's understandable since it has better adjacent store groups like,
  tmp[0][i] i=/0,1,2,3/ (one group)
  tmp[1][i] i=/0,1,2,3/ (one group)
  tmp[2][i] i=/0,1,2,3/ (one group)
  tmp[3][i] i=/0,1,2,3/ (one group)

A-way requires some additional vector permutations.  However, I thought
if the existing scheme can't get any SLP chances, it looks reasonable to
extend it to consider this A-way grouping.  Does it make sense?

Another question is that even if we can go with A-way grouping, it can
only handle packing one byte (four iteration -> 4), what place is
reasonable to extend it to pack more?  How about scaning all leaf
nodes and consider/transform them together?  too hacky?

Any comments are highly appreciated.  Thanks in advance!


BR,
Kewen

How to extend SLP to support this case

Reply via email to