"Andre Vieira (lists)" <andre.simoesdiasvie...@arm.com> writes: > Hi, > > Addressed all of your comments bar the pred ops one. > > Is this OK? > > > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_vector_costs): Define > determine_suggested_unroll_factor and m_nosve_pattern. > (determine_suggested_unroll_factor): New function. > (aarch64_vector_costs::add_stmt_cost): Check for a qualifying > pattern > to set m_nosve_pattern. > (aarch64_vector_costs::finish_costs): Use > determine_suggested_unroll_factor. > * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New. > > On 16/03/2022 18:01, Richard Sandiford wrote: >> "Andre Vieira (lists)" <andre.simoesdiasvie...@arm.com> writes: >>> Hi, >>> >>> This patch implements the costing function >>> determine_suggested_unroll_factor for aarch64. >>> It determines the unrolling factor by dividing the number of X >>> operations we can do per cycle by the number of X operations in the loop >>> body, taking this information from the vec_ops analysis during vector >>> costing and the available issue_info information. >>> We multiply the dividend by a potential reduction_latency, to improve >>> our pipeline utilization if we are stalled waiting on a particular >>> reduction operation. >>> >>> Right now we also have a work around for vectorization choices where the >>> main loop uses a NEON mode and predication is available, such that if >>> the main loop makes use of a NEON pattern that is not directly supported >>> by SVE we do not unroll, as that might cause performance regressions in >>> cases where we would enter the original main loop's VF. As an example if >>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would >>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated >>> epilogue, using other instructions. Whereas with the unrolling you would >>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping >>> the original 8x NEON. In the future, we could handle this differently, >>> by either using a different costing model for epilogues, or potentially >>> vectorizing more than one single epilogue. >>> >>> gcc/ChangeLog: >>> >>> * config/aarch64/aarch64.cc (aarch64_vector_costs): Define >>> determine_suggested_unroll_factor. >>> (determine_suggested_unroll_factor): New function. >>> (aarch64_vector_costs::finish_costs): Use >>> determine_suggested_unroll_factor. >>> >>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>> index >>> b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 >>> 100644 >>> --- a/gcc/config/aarch64/aarch64.cc >>> +++ b/gcc/config/aarch64/aarch64.cc >>> @@ -15680,6 +15680,7 @@ private: >>> unsigned int adjust_body_cost (loop_vec_info, const >>> aarch64_vector_costs *, >>> unsigned int); >>> bool prefer_unrolled_loop () const; >>> + unsigned int determine_suggested_unroll_factor (); >>> >>> /* True if we have performed one-time initialization based on the >>> vec_info. */ >>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count >>> *ops, >>> return sve_cycles_per_iter; >>> } >>> >>> +unsigned int >>> +aarch64_vector_costs::determine_suggested_unroll_factor () >>> +{ >>> + auto *issue_info = aarch64_tune_params.vec_costs->issue_info; >>> + if (!issue_info) >>> + return 1; >>> + bool sve = false; >>> + if (aarch64_sve_mode_p (m_vinfo->vector_mode)) >> Other code uses m_vec_flags & VEC_ANY_SVE for this. >> >>> + { >>> + if (!issue_info->sve) >>> + return 1; >>> + sve = true; >>> + } >>> + else >>> + { >>> + if (!issue_info->advsimd) >>> + return 1; >> The issue info should instead be taken from vec_ops.simd_issue_info () >> in the loop below. It can vary for each entry. >> >>> + /* If we are trying to unroll a NEON main loop that contains patterns >> s/a NEON/an Advanced SIMD/ >> >>> + that we do not support with SVE and we might use a predicated >>> + epilogue, we need to be conservative and block unrolling as this might >>> + lead to a less optimal loop for the first and only epilogue using the >>> + original loop's vectorization factor. >>> + TODO: Remove this constraint when we add support for multiple epilogue >>> + vectorization. */ >>> + if (partial_vectors_supported_p () >>> + && param_vect_partial_vector_usage != 0 >>> + && !TARGET_SVE2) >>> + { >>> + unsigned int i; >>> + stmt_vec_info stmt_vinfo; >>> + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo) >>> + { >>> + if (is_pattern_stmt_p (stmt_vinfo)) >>> + { >>> + gimple *stmt = stmt_vinfo->stmt; >>> + if (is_gimple_call (stmt) >>> + && gimple_call_internal_p (stmt)) >>> + { >>> + enum internal_fn ifn >>> + = gimple_call_internal_fn (stmt); >>> + switch (ifn) >>> + { >>> + case IFN_AVG_FLOOR: >>> + case IFN_AVG_CEIL: >>> + return 1; >>> + default: >>> + break; >>> + } >>> + } >>> + } >>> + } >>> + } >> I think we should instead record this during add_stmt_cost, like we do >> for the other data that this function uses. >> >> We need to drop the partial_vectors_supported_p () condition though: >> enabling SVE must not make Advanced SIMD code worse. So for now, >> Advanced SIMD-only code will have to suffer the missed unrolling >> opportunity too. >> >>> + } >>> + >>> + unsigned int max_unroll_factor = 1; >>> + aarch64_simd_vec_issue_info const *vec_issue >>> + = sve ? issue_info->sve : issue_info->advsimd; >>> + for (auto vec_ops : m_ops) >>> + { >>> + /* Limit unroll factor to 4 for now. */ >>> + unsigned int unroll_factor = 4; >> Would be good to make this an aarch64-specific --param in case people >> want to play with different values. >> >>> + unsigned int factor >>> + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; >>> + unsigned int temp; >>> + >>> + /* Sanity check, this should never happen. */ >>> + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) >>> + return 1; >>> + >>> + /* Check stores. */ >>> + if (vec_ops.stores > 0) >>> + { >>> + temp = CEIL (factor * vec_issue->stores_per_cycle, >>> + vec_ops.stores); >>> + unroll_factor = MIN (unroll_factor, temp); >>> + } >>> + >>> + /* Check loads. */ >>> + if (vec_ops.loads > 0) >>> + { >>> + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, >>> + vec_ops.loads); >> This should use loads + stores rather than just loads. >> >>> + unroll_factor = MIN (unroll_factor, temp); >>> + } >>> + >>> + /* Check general ops. */ >>> + if (vec_ops.general_ops > 0) >>> + { >>> + temp = CEIL (factor * vec_issue->general_ops_per_cycle, >>> + vec_ops.general_ops); >>> + unroll_factor = MIN (unroll_factor, temp); >>> + } >> In principle we should also take pred_ops into account when >> sve_issue_info () is nonnull, while ignoring the pred_ops associated >> with loop predication (added by analyze_loop_vinfo). I guess we >> should maintain two pred_op counts. >> >> That's just a suggestion for a possible future improvement though. >> It doesn't need to be done as part of this patch. >> >> Thanks, >> Richard >> >>> + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); >>> + } >>> + >>> + /* Make sure unroll factor is power of 2. */ >>> + return 1 << ceil_log2 (max_unroll_factor); >>> +} >>> + >>> /* BODY_COST is the cost of a vector loop body. Adjust the cost as >>> necessary >>> and return the new cost. */ >>> unsigned int >>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const >>> vector_costs *uncast_scalar_costs) >>> if (loop_vinfo >>> && m_vec_flags >>> && aarch64_use_new_vector_costs_p ()) >>> - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>> - m_costs[vect_body]); >>> + { >>> + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>> + m_costs[vect_body]); >>> + m_suggested_unroll_factor = determine_suggested_unroll_factor (); >>> + } >>> >>> /* Apply the heuristic described above m_stp_sequence_cost. Prefer >>> the scalar code in the event of a tie, since there is more chance > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index > 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 > 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -15637,6 +15637,7 @@ private: > unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, > unsigned int); > bool prefer_unrolled_loop () const; > + unsigned int determine_suggested_unroll_factor (); > > /* True if we have performed one-time initialization based on the > vec_info. */ > @@ -15699,6 +15700,10 @@ private: > or vector loop. There is one entry for each tuning option of > interest. */ > auto_vec<aarch64_vec_op_count, 2> m_ops; > + > + /* This loop uses a pattern that is not supported by SVE, but is supported > by > + Advanced SIMD and SVE2. */ > + bool m_nosve_pattern = false;
This is true for other things besides AVG_FLOOR/CEIL. I think we should just accept that this is an x264 hack ;-) and name it “m_has_avg” instead. Could you put it with the other bool field? That would lead to a nicer layout, although it hardly matters much here. > }; > > aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo, > @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > as one iteration of the SVE loop. */ > if (where == vect_body && m_unrolled_advsimd_niters) > m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters; > + > + if (is_pattern_stmt_p (stmt_info)) > + { Not sure this is necessary. We'd want the same thing if (for whatever reason) we had a pre-existing IFN_AVG_FLOOR/CEIL in future. > + gimple *stmt = stmt_info->stmt; > + if (is_gimple_call (stmt) > + && gimple_call_internal_p (stmt)) > + { > + enum internal_fn ifn > + = gimple_call_internal_fn (stmt); > + switch (ifn) Very minor nit, but might as well switch on gimple_call_internal_fn (stmt) directly, to avoid the awkward line break. > + { > + case IFN_AVG_FLOOR: > + case IFN_AVG_CEIL: > + m_nosve_pattern = true; > + default: > + break; > + } > + } > + } > } > return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); > } > @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count > *ops, > return sve_cycles_per_iter; > } > > +unsigned int > +aarch64_vector_costs::determine_suggested_unroll_factor () > +{ > + bool sve = m_vec_flags & VEC_ANY_SVE; > + /* If we are trying to unroll an Advanced SIMD main loop that contains > + patterns that we do not support with SVE and we might use a predicated > + epilogue, we need to be conservative and block unrolling as this might > + lead to a less optimal loop for the first and only epilogue using the > + original loop's vectorization factor. > + TODO: Remove this constraint when we add support for multiple epilogue > + vectorization. */ > + if (!sve && TARGET_SVE2 && m_nosve_pattern) > + return 1; Shouldn't this be !TARGET_SVE2? > + > + unsigned int max_unroll_factor = 1; > + for (auto vec_ops : m_ops) > + { > + aarch64_simd_vec_issue_info const *vec_issue > + = vec_ops.simd_issue_info (); > + if (!vec_issue) > + return 1; > + /* Limit unroll factor to a value adjustable by the user, the default > + value is 4. */ > + unsigned int unroll_factor = aarch64_vect_unroll_limit; > + unsigned int factor > + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; > + unsigned int temp; > + > + /* Sanity check, this should never happen. */ > + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) > + return 1; > + > + /* Check stores. */ > + if (vec_ops.stores > 0) > + { > + temp = CEIL (factor * vec_issue->stores_per_cycle, > + vec_ops.stores); > + unroll_factor = MIN (unroll_factor, temp); > + } > + > + /* Check loads + stores. */ > + if (vec_ops.loads > 0) > + { > + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, > + vec_ops.loads + vec_ops.stores); > + unroll_factor = MIN (unroll_factor, temp); > + } > + > + /* Check general ops. */ > + if (vec_ops.general_ops > 0) > + { > + temp = CEIL (factor * vec_issue->general_ops_per_cycle, > + vec_ops.general_ops); > + unroll_factor = MIN (unroll_factor, temp); > + } > + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); > + } > + > + /* Make sure unroll factor is power of 2. */ > + return 1 << ceil_log2 (max_unroll_factor); > +} > + > /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary > and return the new cost. */ > unsigned int > @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const > vector_costs *uncast_scalar_costs) > if (loop_vinfo > && m_vec_flags > && aarch64_use_new_vector_costs_p ()) > - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > - m_costs[vect_body]); > + { > + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > + m_costs[vect_body]); > + m_suggested_unroll_factor = determine_suggested_unroll_factor (); > + } > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > the scalar code in the event of a tie, since there is more chance > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt > index > 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 > 100644 > --- a/gcc/config/aarch64/aarch64.opt > +++ b/gcc/config/aarch64/aarch64.opt > @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using > MOPS sequence. > -param=aarch64-mops-memset-size-threshold= > Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) > Param > Constant memset size in bytes from which to start using MOPS sequence. > + > +-param=aarch64-vect-unroll-limit= > +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param > +Limit how much the autovectorizer may unroll a loop. This needs to be documented in invoke.texi, alongside the existing AArch64-specific params. OK with those changes, thanks. Richard