Hi all, I'm investigating whether GCC can vectorize the below case on ppc64le.
extern void test(unsigned int t[4][4]); void foo(unsigned char *p1, int i1, unsigned char *p2, int i2) { unsigned int tmp[4][4]; unsigned int a0, a1, a2, a3; for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16); a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16); a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16); a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16); int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[i][0] = t0 + t2; tmp[i][2] = t0 - t2; tmp[i][1] = t1 + t3; tmp[i][3] = t1 - t3; } test(tmp); } With unlimited costs, I saw loop aware SLP can vectorize it but with very inefficient codes. It builds the SLP instance from store group {tmp[i][0] tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0}, {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands for tmp* and t*. It means it's unable to make the isomorphic group for a0, a1, a2, a3, although they appears isomorphic to merge. Even if it can recognize over_widening pattern and do some parallel for two a0 from two iterations, but it's still inefficient (high cost). In this context, it looks better to build <a0, a1, a2, a3> first by leveraging isomorphic computation trees constructing them, eg: w1_0123 = load_word(p1) V1_0123 = construct_vec(w1_0123) w1_4567 = load_word(p1 + 4) V1_4567 = construct_vec(w1_4567) w2_0123 = load_word(p2) V2_0123 = construct_vec(w2_0123) w2_4567 = load_word(p2 + 4) V2_4567 = construct_vec(w2_4567) V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16 But how to teach it to be aware of this? Currently the processing starts from bottom to up (from stores), can we do some analysis on the SLP instance, detect some pattern and update the whole instance? Besides, I also tried whether the standalone SLP pass can handle it, it failed to. It stops at a* due to incompatible vector types. The optimal vectorized SLP codes can be: // p1 byte 0 to byte 7 d1_0_7 = load_dword(p1) // p1+i1 b0 to b7, rename it as 8 to 15 d1_8_15 = load_dword(p1 + i1) d1_16_23 = load_dword(p1 + 2*i1) d1_24_31 = load_dword(p1 + 3*i1) V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char V_d1_16_31 = construct_vec(d1_16_23,d1_24_31) V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27}) V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31}) // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all // Do the subtraction together (all 4x4 bytes) V_sub1 = V_d1_0_3_all - V_d2_0_3_all V_sub2 = V_d1_4_7_all - V_d2_4_7_all // Do some unpack and get the promoted vector int V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28} V_a0_1 = V_a0_tmp << 16 V_a0_0 = vec_promote(V_sub1, {0 1 2 3}). // vector int {b0 b8 b16 b24} // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3} V_a0 = V_a0_0 + V_a0_1 // Get the similar for V_a1, V_a2, V_a3 // Compute t0/t1/t2/t3 // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3} V_t0 = V_a0 + V_a1 V_t1 = V_a0 - V_a1 V_t2 = V_a2 + V_a3 V_t3 = V_a2 - V_a3 // Compute tmps // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]} V_tmp0 = V_t0 + V_t2 V_tmp2 = V_t0 - V_t2 V_tmp1 = V_t1 + V_t3 V_tmp3 = V_t1 - V_t3 // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ... // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3 >From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg: tmp[i][0] i=/0,1,2,3/ (one group) tmp[i][1] i=/0,1,2,3/ (one group) tmp[i][2] i=/0,1,2,3/ (one group) tmp[i][3] i=/0,1,2,3/ (one group) which tmp[i][j] group have the same isomorphic computations. But currently SLP is unable to divide group like this way. (call it as A-way for now) It's understandable since it has better adjacent store groups like, tmp[0][i] i=/0,1,2,3/ (one group) tmp[1][i] i=/0,1,2,3/ (one group) tmp[2][i] i=/0,1,2,3/ (one group) tmp[3][i] i=/0,1,2,3/ (one group) A-way requires some additional vector permutations. However, I thought if the existing scheme can't get any SLP chances, it looks reasonable to extend it to consider this A-way grouping. Does it make sense? Another question is that even if we can go with A-way grouping, it can only handle packing one byte (four iteration -> 4), what place is reasonable to extend it to pack more? How about scaning all leaf nodes and consider/transform them together? too hacky? Any comments are highly appreciated. Thanks in advance! BR, Kewen