This patch was updated with some new change. When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example:
int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod <vector(16) char> sum += w[i]; // widen-sum <vector(16) char> sum += abs(s0[i] - s1[i]); // sad <vector(8) short> } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vectorized lane-reducing statements be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); } 2024-03-22 Feng Xue <f...@os.amperecomputing.com> gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 43 +++++++++++++++++++++++++++++++++++++++---- gcc/tree-vectorizer.h | 6 ++++++ 2 files changed, 45 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 5a27a2c3d9c..adee54350d4 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8821,9 +8821,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; @@ -8831,7 +8831,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } - */ + + Moreover, for a higher instruction parallelism in final vectorized + loop, it is considered to make those effective vectorized lane- + reducing statements be distributed evenly among all def-use cycles. + In the above example, SADs are generated into other cycles rather + than that of DOT_PROD. */ tree phi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); unsigned all_ncopies = vect_get_num_copies (loop_vinfo, phi_vectype_in); unsigned use_ncopies = vec_oprnds[0].length (); @@ -8855,6 +8860,36 @@ vect_transform_reduction (loop_vec_info loop_vinfo, gcc_assert (vec_oprnds[i].length () == use_ncopies); vec_oprnds[i].safe_grow_cleared (all_ncopies); } + + /* Find suitable def-use cycles to generate vectorized statements + into, and reorder operands based on the selection. */ + unsigned curr_pos = reduc_info->reduc_result_pos; + unsigned next_pos = (curr_pos + use_ncopies) % all_ncopies; + + gcc_assert (curr_pos < all_ncopies); + reduc_info->reduc_result_pos = next_pos; + + if (curr_pos) + { + unsigned count = all_ncopies - use_ncopies; + unsigned start = curr_pos - count; + + if ((int) start < 0) + { + count = curr_pos; + start = 0; + } + + for (unsigned i = 0; i < op.num_ops - 1; i++) + { + for (unsigned j = use_ncopies; j > start; j--) + { + unsigned k = j - 1; + std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]); + gcc_assert (!vec_oprnds[i][k]); + } + } + } } } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 94736736dcc..64c6571a293 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + unsigned int reduc_result_pos; + /* If IS_REDUC_INFO is true and if the vector code is performing N scalar reductions in parallel, this variable gives the initial scalar values of those N reductions. */ -- 2.17.1 ________________________________________ From: Feng Xue OS <f...@os.amperecomputing.com> Sent: Sunday, June 16, 2024 3:32 PM To: Richard Biener Cc: gcc-patches@gcc.gnu.org Subject: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod <vector(16) char> sum += w[i]; // widen-sum <vector(16) char> sum += abs(s0[i] - s1[i]); // sad <vector(8) short> } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vectorized lane-reducing statements be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency could be eliminated. Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 39 +++++++++++++++++++++++++++++++++++---- gcc/tree-vectorizer.h | 6 ++++++ 2 files changed, 41 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 6d91665a341..c7e13d655d8 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; @@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } - */ + + Moreover, for a higher instruction parallelism in final vectorized + loop, it is considered to make those effective vectorized lane- + reducing statements be distributed evenly among all def-use cycles. + In the above example, SADs are generated into other cycles rather + than that of DOT_PROD. */ unsigned using_ncopies = vec_oprnds[0].length (); unsigned reduc_ncopies = vec_oprnds[reduc_index].length (); + unsigned result_pos = reduc_info->reduc_result_pos; + + reduc_info->reduc_result_pos + = (result_pos + using_ncopies) % reduc_ncopies; + gcc_assert (result_pos < reduc_ncopies); for (unsigned i = 0; i < op.num_ops - 1; i++) { gcc_assert (vec_oprnds[i].length () == using_ncopies); vec_oprnds[i].safe_grow_cleared (reduc_ncopies); + + /* Find suitable def-use cycles to generate vectorized statements + into, and reorder operands based on the selection. */ + if (result_pos) + { + unsigned count = reduc_ncopies - using_ncopies; + unsigned start = result_pos - count; + + if ((int) start < 0) + { + count = result_pos; + start = 0; + } + + for (unsigned j = using_ncopies; j > start; j--) + { + unsigned k = j - 1; + std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]); + gcc_assert (!vec_oprnds[i][k]); + } + } } } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 94736736dcc..64c6571a293 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + unsigned int reduc_result_pos; + /* If IS_REDUC_INFO is true and if the vector code is performing N scalar reductions in parallel, this variable gives the initial scalar values of those N reductions. */ -- 2.17.1
From 9435bf24fd085796d5a801a7f8f22a2374d27458 Mon Sep 17 00:00:00 2001 From: Feng Xue <f...@os.amperecomputing.com> Date: Wed, 29 May 2024 17:28:14 +0800 Subject: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod <vector(16) char> sum += w[i]; // widen-sum <vector(16) char> sum += abs(s0[i] - s1[i]); // sad <vector(8) short> } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vectorized lane-reducing statements be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); } 2024-03-22 Feng Xue <f...@os.amperecomputing.com> gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 43 +++++++++++++++++++++++++++++++++++++++---- gcc/tree-vectorizer.h | 6 ++++++ 2 files changed, 45 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 5a27a2c3d9c..adee54350d4 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8821,9 +8821,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; @@ -8831,7 +8831,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } - */ + + Moreover, for a higher instruction parallelism in final vectorized + loop, it is considered to make those effective vectorized lane- + reducing statements be distributed evenly among all def-use cycles. + In the above example, SADs are generated into other cycles rather + than that of DOT_PROD. */ tree phi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); unsigned all_ncopies = vect_get_num_copies (loop_vinfo, phi_vectype_in); unsigned use_ncopies = vec_oprnds[0].length (); @@ -8855,6 +8860,36 @@ vect_transform_reduction (loop_vec_info loop_vinfo, gcc_assert (vec_oprnds[i].length () == use_ncopies); vec_oprnds[i].safe_grow_cleared (all_ncopies); } + + /* Find suitable def-use cycles to generate vectorized statements + into, and reorder operands based on the selection. */ + unsigned curr_pos = reduc_info->reduc_result_pos; + unsigned next_pos = (curr_pos + use_ncopies) % all_ncopies; + + gcc_assert (curr_pos < all_ncopies); + reduc_info->reduc_result_pos = next_pos; + + if (curr_pos) + { + unsigned count = all_ncopies - use_ncopies; + unsigned start = curr_pos - count; + + if ((int) start < 0) + { + count = curr_pos; + start = 0; + } + + for (unsigned i = 0; i < op.num_ops - 1; i++) + { + for (unsigned j = use_ncopies; j > start; j--) + { + unsigned k = j - 1; + std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]); + gcc_assert (!vec_oprnds[i][k]); + } + } + } } } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 94736736dcc..64c6571a293 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + unsigned int reduc_result_pos; + /* If IS_REDUC_INFO is true and if the vector code is performing N scalar reductions in parallel, this variable gives the initial scalar values of those N reductions. */ -- 2.17.1