Re: How to extend SLP to support this case

Richard Biener Tue, 10 Mar 2020 04:15:04 -0700

On Tue, Mar 10, 2020 at 12:12 PM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <li...@linux.ibm.com> wrote:
> >
> > Hi all,
> >
> > I'm investigating whether GCC can vectorize the below case on ppc64le.
> >
> >   extern void test(unsigned int t[4][4]);
> >
> >   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
> >   {
> >     unsigned int tmp[4][4];
> >     unsigned int a0, a1, a2, a3;
> >
> >     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> >       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
> >       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
> >       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
> >       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
> >
> >       int t0 = a0 + a1;
> >       int t1 = a0 - a1;
> >       int t2 = a2 + a3;
> >       int t3 = a2 - a3;
> >
> >       tmp[i][0] = t0 + t2;
> >       tmp[i][2] = t0 - t2;
> >       tmp[i][1] = t1 + t3;
> >       tmp[i][3] = t1 - t3;
> >     }
> >     test(tmp);
> >   }
> >
> > With unlimited costs, I saw loop aware SLP can vectorize it but with very
> > inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
> > tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0},
> > {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
> > for tmp* and t*.  It means it's unable to make the isomorphic group for
> > a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
> > recognize over_widening pattern and do some parallel for two a0 from two
> > iterations, but it's still inefficient (high cost).
> >
> > In this context, it looks better to build <a0, a1, a2, a3> first by
> > leveraging isomorphic computation trees constructing them, eg:
> >   w1_0123 = load_word(p1)
> >   V1_0123 = construct_vec(w1_0123)
> >   w1_4567 = load_word(p1 + 4)
> >   V1_4567 = construct_vec(w1_4567)
> >   w2_0123 = load_word(p2)
> >   V2_0123 = construct_vec(w2_0123)
> >   w2_4567 = load_word(p2 + 4)
> >   V2_4567 = construct_vec(w2_4567)
> >   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
> >
> > But how to teach it to be aware of this? Currently the processing starts
> > from bottom to up (from stores), can we do some analysis on the SLP
> > instance, detect some pattern and update the whole instance?
>
> In theory yes (Tamar had something like that for AARCH64 complex
> rotations IIRC).  And yes, the issue boils down to how we handle
> SLP discovery.  I'd like to improve SLP discovery but it's on my list
> only after I managed to get rid of the non-SLP code paths.  I have
> played with some ideas (even produced hackish patches) to find
> "seeds" to form SLP groups from using multi-level hashing of stmts.
> My plan is to rewrite SLP discovery completely, starting from a
> SLP graph that 1:1 reflects the SSA use-def graph (no groups
> formed yet) and then form groups from seeds and insert
> "connection" nodes (merging two subgraphs with less lanes
> into one with more lanes, splitting lanes, permuting lanes, etc.).
>
> Currently I'm working on doing exactly this but only for SLP loads
> (because that's where it's most difficult...).
>
> > Besides, I also tried whether the standalone SLP pass can handle it,
> > it failed to. It stops at a* due to incompatible vector types.
> >
> > The optimal vectorized SLP codes can be:
> >   // p1 byte 0 to byte 7
> >   d1_0_7 = load_dword(p1)
> >   // p1+i1 b0 to b7, rename it as 8 to 15
> >   d1_8_15 = load_dword(p1 + i1)
> >   d1_16_23 = load_dword(p1 + 2*i1)
> >   d1_24_31 = load_dword(p1 + 3*i1)
> >
> >   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
> >   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
> >   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
> >                       {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
> >   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
> >                       {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
> >
> >   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
> >
> >   // Do the subtraction together (all 4x4 bytes)
> >   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
> >   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
> >
> >   // Do some unpack and get the promoted vector int
> >   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
> >   V_a0_1 = V_a0_tmp << 16
> >   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
> >   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
> >   V_a0 = V_a0_0 + V_a0_1
> >
> >   // Get the similar for V_a1, V_a2, V_a3
> >
> >   // Compute t0/t1/t2/t3
> >   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
> >   V_t0 = V_a0 + V_a1
> >   V_t1 = V_a0 - V_a1
> >   V_t2 = V_a2 + V_a3
> >   V_t3 = V_a2 - V_a3
> >
> >   // Compute tmps
> >   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
> >   V_tmp0 = V_t0 + V_t2
> >   V_tmp2 = V_t0 - V_t2
> >   V_tmp1 = V_t1 + V_t3
> >   V_tmp3 = V_t1 - V_t3
> >
> >   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
> >   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
> >
> > From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, 
> > eg:
> >   tmp[i][0] i=/0,1,2,3/ (one group)
> >   tmp[i][1] i=/0,1,2,3/ (one group)
> >   tmp[i][2] i=/0,1,2,3/ (one group)
> >   tmp[i][3] i=/0,1,2,3/ (one group)
> >
> > which tmp[i][j] group have the same isomorphic computations.  But currently
> > SLP is unable to divide group like this way. (call it as A-way for now)
> >
> > It's understandable since it has better adjacent store groups like,
> >   tmp[0][i] i=/0,1,2,3/ (one group)
> >   tmp[1][i] i=/0,1,2,3/ (one group)
> >   tmp[2][i] i=/0,1,2,3/ (one group)
> >   tmp[3][i] i=/0,1,2,3/ (one group)


Note this is how the non-SLP path will (try to) vectorize the loop.

> > A-way requires some additional vector permutations.  However, I thought
> > if the existing scheme can't get any SLP chances, it looks reasonable to
> > extend it to consider this A-way grouping.  Does it make sense?
> >
> > Another question is that even if we can go with A-way grouping, it can
> > only handle packing one byte (four iteration -> 4), what place is
> > reasonable to extend it to pack more?  How about scaning all leaf
> > nodes and consider/transform them together?  too hacky?
>
> Well, not sure - in the end it will all be heuristics since otherwise
> the exploration space is too big.  But surely concentrating on
> load/store groups is good.
>
> The SLP discovery code is already quite complicated btw., I'd
> hate to add "unstructured" hacks ontop of it right now without
> future design goals in mind.
>
> Richard.
>
> > Any comments are highly appreciated.  Thanks in advance!
> >
> >
> > BR,
> > Kewen
> >

Re: How to extend SLP to support this case

Reply via email to