On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <li...@linux.ibm.com> wrote:
>
> Hi all,
>
> I'm investigating whether GCC can vectorize the below case on ppc64le.
>
>   extern void test(unsigned int t[4][4]);
>
>   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
>   {
>     unsigned int tmp[4][4];
>     unsigned int a0, a1, a2, a3;
>
>     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
>       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
>       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
>       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
>
>       int t0 = a0 + a1;
>       int t1 = a0 - a1;
>       int t2 = a2 + a3;
>       int t3 = a2 - a3;
>
>       tmp[i][0] = t0 + t2;
>       tmp[i][2] = t0 - t2;
>       tmp[i][1] = t1 + t3;
>       tmp[i][3] = t1 - t3;
>     }
>     test(tmp);
>   }
>
> With unlimited costs, I saw loop aware SLP can vectorize it but with very
> inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
> tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0},
> {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
> for tmp* and t*.  It means it's unable to make the isomorphic group for
> a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
> recognize over_widening pattern and do some parallel for two a0 from two
> iterations, but it's still inefficient (high cost).
>
> In this context, it looks better to build <a0, a1, a2, a3> first by
> leveraging isomorphic computation trees constructing them, eg:
>   w1_0123 = load_word(p1)
>   V1_0123 = construct_vec(w1_0123)
>   w1_4567 = load_word(p1 + 4)
>   V1_4567 = construct_vec(w1_4567)
>   w2_0123 = load_word(p2)
>   V2_0123 = construct_vec(w2_0123)
>   w2_4567 = load_word(p2 + 4)
>   V2_4567 = construct_vec(w2_4567)
>   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
>
> But how to teach it to be aware of this? Currently the processing starts
> from bottom to up (from stores), can we do some analysis on the SLP
> instance, detect some pattern and update the whole instance?
In theory yes (Tamar had something like that for AARCH64 complex
rotations IIRC).  And yes, the issue boils down to how we handle
SLP discovery.  I'd like to improve SLP discovery but it's on my list
only after I managed to get rid of the non-SLP code paths.  I have
played with some ideas (even produced hackish patches) to find
"seeds" to form SLP groups from using multi-level hashing of stmts.
My plan is to rewrite SLP discovery completely, starting from a
SLP graph that 1:1 reflects the SSA use-def graph (no groups
formed yet) and then form groups from seeds and insert
"connection" nodes (merging two subgraphs with less lanes
into one with more lanes, splitting lanes, permuting lanes, etc.).

Currently I'm working on doing exactly this but only for SLP loads
(because that's where it's most difficult...).

> Besides, I also tried whether the standalone SLP pass can handle it,
> it failed to. It stops at a* due to incompatible vector types.
>
> The optimal vectorized SLP codes can be:
>   // p1 byte 0 to byte 7
>   d1_0_7 = load_dword(p1)
>   // p1+i1 b0 to b7, rename it as 8 to 15
>   d1_8_15 = load_dword(p1 + i1)
>   d1_16_23 = load_dword(p1 + 2*i1)
>   d1_24_31 = load_dword(p1 + 3*i1)
>
>   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
>   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
>   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
>                       {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
>   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
>                       {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
>
>   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
>
>   // Do the subtraction together (all 4x4 bytes)
>   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
>   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
>
>   // Do some unpack and get the promoted vector int
>   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
>   V_a0_1 = V_a0_tmp << 16
>   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
>   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
>   V_a0 = V_a0_0 + V_a0_1
>
>   // Get the similar for V_a1, V_a2, V_a3
>
>   // Compute t0/t1/t2/t3
>   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
>   V_t0 = V_a0 + V_a1
>   V_t1 = V_a0 - V_a1
>   V_t2 = V_a2 + V_a3
>   V_t3 = V_a2 - V_a3
>
>   // Compute tmps
>   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
>   V_tmp0 = V_t0 + V_t2
>   V_tmp2 = V_t0 - V_t2
>   V_tmp1 = V_t1 + V_t3
>   V_tmp3 = V_t1 - V_t3
>
>   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
>   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
>
> From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
>   tmp[i][0] i=/0,1,2,3/ (one group)
>   tmp[i][1] i=/0,1,2,3/ (one group)
>   tmp[i][2] i=/0,1,2,3/ (one group)
>   tmp[i][3] i=/0,1,2,3/ (one group)
>
> which tmp[i][j] group have the same isomorphic computations.  But currently
> SLP is unable to divide group like this way. (call it as A-way for now)
>
> It's understandable since it has better adjacent store groups like,
>   tmp[0][i] i=/0,1,2,3/ (one group)
>   tmp[1][i] i=/0,1,2,3/ (one group)
>   tmp[2][i] i=/0,1,2,3/ (one group)
>   tmp[3][i] i=/0,1,2,3/ (one group)
>
> A-way requires some additional vector permutations.  However, I thought
> if the existing scheme can't get any SLP chances, it looks reasonable to
> extend it to consider this A-way grouping.  Does it make sense?
>
> Another question is that even if we can go with A-way grouping, it can
> only handle packing one byte (four iteration -> 4), what place is
> reasonable to extend it to pack more?  How about scaning all leaf
> nodes and consider/transform them together?  too hacky?

Well, not sure - in the end it will all be heuristics since otherwise
the exploration space is too big.  But surely concentrating on
load/store groups is good.

The SLP discovery code is already quite complicated btw., I'd
hate to add "unstructured" hacks ontop of it right now without
future design goals in mind.

Richard.

> Any comments are highly appreciated.  Thanks in advance!
>
>
> BR,
> Kewen
>

Reply via email to