On Tue, Mar 10, 2020 at 12:12 PM Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <li...@linux.ibm.com> wrote: > > > > Hi all, > > > > I'm investigating whether GCC can vectorize the below case on ppc64le. > > > > extern void test(unsigned int t[4][4]); > > > > void foo(unsigned char *p1, int i1, unsigned char *p2, int i2) > > { > > unsigned int tmp[4][4]; > > unsigned int a0, a1, a2, a3; > > > > for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { > > a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16); > > a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16); > > a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16); > > a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16); > > > > int t0 = a0 + a1; > > int t1 = a0 - a1; > > int t2 = a2 + a3; > > int t3 = a2 - a3; > > > > tmp[i][0] = t0 + t2; > > tmp[i][2] = t0 - t2; > > tmp[i][1] = t1 + t3; > > tmp[i][3] = t1 - t3; > > } > > test(tmp); > > } > > > > With unlimited costs, I saw loop aware SLP can vectorize it but with very > > inefficient codes. It builds the SLP instance from store group {tmp[i][0] > > tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0}, > > {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands > > for tmp* and t*. It means it's unable to make the isomorphic group for > > a0, a1, a2, a3, although they appears isomorphic to merge. Even if it can > > recognize over_widening pattern and do some parallel for two a0 from two > > iterations, but it's still inefficient (high cost). > > > > In this context, it looks better to build <a0, a1, a2, a3> first by > > leveraging isomorphic computation trees constructing them, eg: > > w1_0123 = load_word(p1) > > V1_0123 = construct_vec(w1_0123) > > w1_4567 = load_word(p1 + 4) > > V1_4567 = construct_vec(w1_4567) > > w2_0123 = load_word(p2) > > V2_0123 = construct_vec(w2_0123) > > w2_4567 = load_word(p2 + 4) > > V2_4567 = construct_vec(w2_4567) > > V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16 > > > > But how to teach it to be aware of this? Currently the processing starts > > from bottom to up (from stores), can we do some analysis on the SLP > > instance, detect some pattern and update the whole instance? > > In theory yes (Tamar had something like that for AARCH64 complex > rotations IIRC). And yes, the issue boils down to how we handle > SLP discovery. I'd like to improve SLP discovery but it's on my list > only after I managed to get rid of the non-SLP code paths. I have > played with some ideas (even produced hackish patches) to find > "seeds" to form SLP groups from using multi-level hashing of stmts. > My plan is to rewrite SLP discovery completely, starting from a > SLP graph that 1:1 reflects the SSA use-def graph (no groups > formed yet) and then form groups from seeds and insert > "connection" nodes (merging two subgraphs with less lanes > into one with more lanes, splitting lanes, permuting lanes, etc.). > > Currently I'm working on doing exactly this but only for SLP loads > (because that's where it's most difficult...). > > > Besides, I also tried whether the standalone SLP pass can handle it, > > it failed to. It stops at a* due to incompatible vector types. > > > > The optimal vectorized SLP codes can be: > > // p1 byte 0 to byte 7 > > d1_0_7 = load_dword(p1) > > // p1+i1 b0 to b7, rename it as 8 to 15 > > d1_8_15 = load_dword(p1 + i1) > > d1_16_23 = load_dword(p1 + 2*i1) > > d1_24_31 = load_dword(p1 + 3*i1) > > > > V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char > > V_d1_16_31 = construct_vec(d1_16_23,d1_24_31) > > V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, > > {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27}) > > V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, > > {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31}) > > > > // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all > > > > // Do the subtraction together (all 4x4 bytes) > > V_sub1 = V_d1_0_3_all - V_d2_0_3_all > > V_sub2 = V_d1_4_7_all - V_d2_4_7_all > > > > // Do some unpack and get the promoted vector int > > V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28} > > V_a0_1 = V_a0_tmp << 16 > > V_a0_0 = vec_promote(V_sub1, {0 1 2 3}). // vector int {b0 b8 b16 b24} > > // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3} > > V_a0 = V_a0_0 + V_a0_1 > > > > // Get the similar for V_a1, V_a2, V_a3 > > > > // Compute t0/t1/t2/t3 > > // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3} > > V_t0 = V_a0 + V_a1 > > V_t1 = V_a0 - V_a1 > > V_t2 = V_a2 + V_a3 > > V_t3 = V_a2 - V_a3 > > > > // Compute tmps > > // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]} > > V_tmp0 = V_t0 + V_t2 > > V_tmp2 = V_t0 - V_t2 > > V_tmp1 = V_t1 + V_t3 > > V_tmp3 = V_t1 - V_t3 > > > > // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ... > > // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3 > > > > From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, > > eg: > > tmp[i][0] i=/0,1,2,3/ (one group) > > tmp[i][1] i=/0,1,2,3/ (one group) > > tmp[i][2] i=/0,1,2,3/ (one group) > > tmp[i][3] i=/0,1,2,3/ (one group) > > > > which tmp[i][j] group have the same isomorphic computations. But currently > > SLP is unable to divide group like this way. (call it as A-way for now) > > > > It's understandable since it has better adjacent store groups like, > > tmp[0][i] i=/0,1,2,3/ (one group) > > tmp[1][i] i=/0,1,2,3/ (one group) > > tmp[2][i] i=/0,1,2,3/ (one group) > > tmp[3][i] i=/0,1,2,3/ (one group)
Note this is how the non-SLP path will (try to) vectorize the loop. > > A-way requires some additional vector permutations. However, I thought > > if the existing scheme can't get any SLP chances, it looks reasonable to > > extend it to consider this A-way grouping. Does it make sense? > > > > Another question is that even if we can go with A-way grouping, it can > > only handle packing one byte (four iteration -> 4), what place is > > reasonable to extend it to pack more? How about scaning all leaf > > nodes and consider/transform them together? too hacky? > > Well, not sure - in the end it will all be heuristics since otherwise > the exploration space is too big. But surely concentrating on > load/store groups is good. > > The SLP discovery code is already quite complicated btw., I'd > hate to add "unstructured" hacks ontop of it right now without > future design goals in mind. > > Richard. > > > Any comments are highly appreciated. Thanks in advance! > > > > > > BR, > > Kewen > >