Hi, Gentle ping: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598601.html
BR, Kewen > > on 2022/7/20 17:30, Kewen.Lin via Gcc-patches wrote: >> Hi, >> >> Commit r12-6679-g7ca1582ca60dc8 made vectorizer accept one >> unroll factor to be applied to vectorization factor when >> vectorizing the main loop, it would be suggested by target >> when doing costing. >> >> This patch introduces function determine_suggested_unroll_factor >> for rs6000 port, to make it be able to suggest the unroll factor >> for a given loop being vectorized. Referring to aarch64 port >> and basing on the analysis on SPEC2017 performance evaluation >> results, it mainly considers these aspects: >> 1) unroll option and pragma which can disable unrolling for the >> given loop; >> 2) simple hardware resource model with issued non memory access >> vector insn per cycle; >> 3) aggressive heuristics when iteration count is unknown: >> - reduction case to break cross iteration dependency; >> - emulated gather load; >> 4) estimated iteration count when iteration count is unknown; >> >> With this patch, SPEC2017 performance evaluation results on >> Power8/9/10 are listed below (speedup pct.): >> >> * Power10 >> - O2: all are neutral (excluding some noises); >> - Ofast: 510.parest_r +6.67%, the others are neutral >> (use ... for the followings); >> - Ofast + unroll: 510.parest_r +5.91%, ... >> - Ofast + LTO + PGO: 510.parest_r +3.00%, ... >> - Ofast + cheap vect cost: 510.parest_r +6.23%, ... >> - Ofast + very-cheap vect cost: all are neutral; >> >> * Power9 >> - Ofast: 510.parest_r +8.73%, 538.imagick_r +11.18% >> (likely noise), 500.perlbench_r +1.84%, ... >> >> * Power8 >> - Ofast: 510.parest_r +5.43%, ...; >> >> This patch also introduces one documented parameter >> rs6000-vect-unroll-limit= similar to what aarch64 proposes, >> by evaluating on P8/P9/P10, the default value 4 is slightly >> better than the other choices like 2 and 8. >> >> It also parameterizes two other values as undocumented >> parameters for future tweaking. One parameter is >> rs6000-vect-unroll-issue, it's to simply model hardware >> resource for non memory access vector instructions to avoid >> excessive unrolling, initially I tried to use the value in >> the hook rs6000_issue_rate, but the evaluation showed it's >> bad, so I evaluated different values 2/4/6/8 on P8/P9/P10 at >> Ofast, the results showed the default value 4 is good enough >> on these different architectures. For a record, choice 8 >> could make 510.parest_r's gain become smaller or gone on >> P8/P9/P10; choice 6 could make 503.bwaves_r degrade by more >> than 1% on P8/P10; and choice 2 could make 538.imagick_r >> degrade by 3.8%. The other parameter is >> rs6000-vect-unroll-reduc-threshold. It's mainly inspired by >> 510.parest_r and tweaked as it, evaluating with different >> values 0/1/2/3 for the threshold, it showed value 1 is the >> best choice. For a record, choice 0 could make 525.x264_r >> degrade by 2% and 527.cam4_r degrade by 2.95% on P10, >> 548.exchange2_r degrade by 1.41% and 527.cam4_r degrade by >> 2.54% on P8; choice 2 and bigger values could make >> 510.parest_r's gain become smaller. >> >> Bootstrapped and regtested on powerpc64-linux-gnu P7 and P8, >> and powerpc64le-linux-gnu P9. Bootstrapped on >> powerpc64le-linux-gnu P10, but one failure was exposed during >> regression testing there, it's identified as one miss >> optimization and can be reproduced without this support, >> PR106365 was opened for further tracking. >> >> Is it for trunk? >> >> BR, >> Kewen >> ------ >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.cc (class rs6000_cost_data): Add new members >> m_nstores, m_reduc_factor, m_gather_load and member function >> determine_suggested_unroll_factor. >> (rs6000_cost_data::update_target_cost_per_stmt): Update for m_nstores, >> m_reduc_factor and m_gather_load. >> (rs6000_cost_data::determine_suggested_unroll_factor): New function. >> (rs6000_cost_data::finish_cost): Use determine_suggested_unroll_factor. >> * config/rs6000/rs6000.opt (rs6000-vect-unroll-limit): New parameter. >> (rs6000-vect-unroll-issue): Likewise. >> (rs6000-vect-unroll-reduc-threshold): Likewise. >> * doc/invoke.texi (rs6000-vect-unroll-limit): Document new parameter. >> >> --- >> gcc/config/rs6000/rs6000.cc | 125 ++++++++++++++++++++++++++++++++++- >> gcc/config/rs6000/rs6000.opt | 18 +++++ >> gcc/doc/invoke.texi | 7 ++ >> 3 files changed, 147 insertions(+), 3 deletions(-) >> >> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc >> index 3ff16b8ae04..d0f107d70a8 100644 >> --- a/gcc/config/rs6000/rs6000.cc >> +++ b/gcc/config/rs6000/rs6000.cc >> @@ -5208,16 +5208,23 @@ protected: >> vect_cost_model_location, unsigned int); >> void density_test (loop_vec_info); >> void adjust_vect_cost_per_loop (loop_vec_info); >> + unsigned int determine_suggested_unroll_factor (loop_vec_info); >> >> /* Total number of vectorized stmts (loop only). */ >> unsigned m_nstmts = 0; >> /* Total number of loads (loop only). */ >> unsigned m_nloads = 0; >> + /* Total number of stores (loop only). */ >> + unsigned m_nstores = 0; >> + /* Reduction factor for suggesting unroll factor (loop only). */ >> + unsigned m_reduc_factor = 0; >> /* Possible extra penalized cost on vector construction (loop only). */ >> unsigned m_extra_ctor_cost = 0; >> /* For each vectorized loop, this var holds TRUE iff a non-memory vector >> instruction is needed by the vectorization. */ >> bool m_vect_nonmem = false; >> + /* If this loop gets vectorized with emulated gather load. */ >> + bool m_gather_load = false; >> }; >> >> /* Test for likely overcommitment of vector hardware resources. If a >> @@ -5368,9 +5375,34 @@ rs6000_cost_data::update_target_cost_per_stmt >> (vect_cost_for_stmt kind, >> { >> m_nstmts += orig_count; >> >> - if (kind == scalar_load || kind == vector_load >> - || kind == unaligned_load || kind == vector_gather_load) >> - m_nloads += orig_count; >> + if (kind == scalar_load >> + || kind == vector_load >> + || kind == unaligned_load >> + || kind == vector_gather_load) >> + { >> + m_nloads += orig_count; >> + if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info)) >> + m_gather_load = true; >> + } >> + else if (kind == scalar_store >> + || kind == vector_store >> + || kind == unaligned_store >> + || kind == vector_scatter_store) >> + m_nstores += orig_count; >> + else if ((kind == scalar_stmt >> + || kind == vector_stmt >> + || kind == vec_to_scalar) >> + && stmt_info >> + && vect_is_reduction (stmt_info)) >> + { >> + /* Loop body contains normal int or fp operations and epilogue >> + contains vector reduction. For simplicity, we assume int >> + operation takes one cycle and fp operation takes one more. */ >> + tree lhs = gimple_get_lhs (stmt_info->stmt); >> + bool is_float = FLOAT_TYPE_P (TREE_TYPE (lhs)); >> + unsigned int basic_cost = is_float ? 2 : 1; >> + m_reduc_factor = MAX (basic_cost * orig_count, m_reduc_factor); >> + } >> >> /* Power processors do not currently have instructions for strided >> and elementwise loads, and instead we must generate multiple >> @@ -5462,6 +5494,90 @@ rs6000_cost_data::adjust_vect_cost_per_loop >> (loop_vec_info loop_vinfo) >> } >> } >> >> +/* Determine suggested unroll factor by considering some below factors: >> + >> + - unroll option/pragma which can disable unrolling for this loop; >> + - simple hardware resource model for non memory vector insns; >> + - aggressive heuristics when iteration count is unknown: >> + - reduction case to break cross iteration dependency; >> + - emulated gather load; >> + - estimated iteration count when iteration count is unknown; >> +*/ >> + >> + >> +unsigned int >> +rs6000_cost_data::determine_suggested_unroll_factor (loop_vec_info >> loop_vinfo) >> +{ >> + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); >> + >> + /* Don't unroll if it's specified explicitly not to be unrolled. */ >> + if (loop->unroll == 1 >> + || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops) >> + || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops)) >> + return 1; >> + >> + unsigned int nstmts_nonldst = m_nstmts - m_nloads - m_nstores; >> + /* Don't unroll if no vector instructions excepting for memory access. */ >> + if (nstmts_nonldst == 0) >> + return 1; >> + >> + /* Consider breaking cross iteration dependency for reduction. */ >> + unsigned int reduc_factor = m_reduc_factor > 1 ? m_reduc_factor : 1; >> + >> + /* Use this simple hardware resource model that how many non ld/st >> + vector instructions can be issued per cycle. */ >> + unsigned int issue_width = rs6000_vect_unroll_issue; >> + unsigned int uf = CEIL (reduc_factor * issue_width, nstmts_nonldst); >> + uf = MIN ((unsigned int) rs6000_vect_unroll_limit, uf); >> + /* Make sure it is power of 2. */ >> + uf = 1 << ceil_log2 (uf); >> + >> + /* If the iteration count is known, the costing would be exact enough, >> + don't worry it could be worse. */ >> + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) >> + return uf; >> + >> + /* Inspired by SPEC2017 parest_r, we want to aggressively unroll the >> + loop if either condition is satisfied: >> + - reduction factor exceeds the threshold; >> + - emulated gather load adopted. */ >> + if (reduc_factor > (unsigned int) rs6000_vect_unroll_reduc_threshold >> + || m_gather_load) >> + return uf; >> + >> + /* Check if we can conclude it's good to unroll from the estimated >> + iteration count. */ >> + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop); >> + unsigned int vf = vect_vf_for_cost (loop_vinfo); >> + unsigned int unrolled_vf = vf * uf; >> + if (est_niter == -1 || est_niter < unrolled_vf) >> + /* When the estimated iteration of this loop is unknown, it's possible >> + that we are able to vectorize this loop with the original VF but fail >> + to vectorize it with the unrolled VF any more if the actual iteration >> + count is in between. */ >> + return 1; >> + else >> + { >> + unsigned int epil_niter_unr = est_niter % unrolled_vf; >> + unsigned int epil_niter = est_niter % vf; >> + /* Even if we have partial vector support, it can be still inefficent >> + to calculate the length when the iteration count is unknown, so >> + only expect it's good to unroll when the epilogue iteration count >> + is not bigger than VF (only one time length calculation). */ >> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) >> + && epil_niter_unr <= vf) >> + return uf; >> + /* Without partial vector support, conservatively unroll this when >> + the epilogue iteration count is less than the original one >> + (epilogue execution time wouldn't be longer than before). */ >> + else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) >> + && epil_niter_unr <= epil_niter) >> + return uf; >> + } >> + >> + return 1; >> +} >> + >> void >> rs6000_cost_data::finish_cost (const vector_costs *scalar_costs) >> { >> @@ -5478,6 +5594,9 @@ rs6000_cost_data::finish_cost (const vector_costs >> *scalar_costs) >> && LOOP_VINFO_VECT_FACTOR (loop_vinfo) == 2 >> && LOOP_REQUIRES_VERSIONING (loop_vinfo)) >> m_costs[vect_body] += 10000; >> + >> + m_suggested_unroll_factor >> + = determine_suggested_unroll_factor (loop_vinfo); >> } >> >> vector_costs::finish_cost (scalar_costs); >> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt >> index 4931d781c4e..80c2c61a9de 100644 >> --- a/gcc/config/rs6000/rs6000.opt >> +++ b/gcc/config/rs6000/rs6000.opt >> @@ -624,6 +624,14 @@ mieee128-constant >> Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save >> Generate (do not generate) code that uses the LXVKQ instruction. >> >> +; Documented parameters >> + >> +-param=rs6000-vect-unroll-limit= >> +Target Joined UInteger Var(rs6000_vect_unroll_limit) Init(4) >> IntegerRange(1, 64) Param >> +Used to limit unroll factor which indicates how much the autovectorizer may >> +unroll a loop. The default value is 4. >> + >> +; Undocumented parameters >> -param=rs6000-density-pct-threshold= >> Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) >> Init(85) IntegerRange(0, 100) Param >> When costing for loop vectorization, we probably need to penalize the loop >> body >> @@ -661,3 +669,13 @@ Like parameter rs6000-density-load-pct-threshold, we >> also check if the total >> number of load statements exceeds the threshold specified by this parameter, >> and penalize only if it's satisfied. The default value is 20. >> >> +-param=rs6000-vect-unroll-issue= >> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_issue) Init(4) >> IntegerRange(1, 128) Param >> +Indicate how many non memory access vector instructions can be issued per >> +cycle, it's used in unroll factor determination for autovectorizer. The >> +default value is 4. >> + >> +-param=rs6000-vect-unroll-reduc-threshold= >> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_reduc_threshold) >> Init(1) Param >> +When reduction factor computed for a loop exceeds the threshold specified by >> +this parameter, prefer to unroll this loop. The default value is 1. >> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi >> index 84d6f0f9860..097ab1d5563 100644 >> --- a/gcc/doc/invoke.texi >> +++ b/gcc/doc/invoke.texi >> @@ -29658,6 +29658,13 @@ Generate (do not generate) code that will run in >> privileged state. >> @opindex no-block-ops-unaligned-vsx >> Generate (do not generate) unaligned vsx loads and stores for >> inline expansion of @code{memcpy} and @code{memmove}. >> + >> +@item --param rs6000-vect-unroll-limit= >> +The vectorizer will check with target information to determine whether it >> +would be beneficial to unroll the main vectorized loop and by how much. >> This >> +parameter sets the upper bound of how much the vectorizer will unroll the >> main >> +loop. The default value is four. >> + >> @end table >> >> @node RX Options >> -- >> 2.27.0