"Andre Vieira (lists)" <andre.simoesdiasvie...@arm.com> writes: > Hi, > > This patch implements the costing function > determine_suggested_unroll_factor for aarch64. > It determines the unrolling factor by dividing the number of X > operations we can do per cycle by the number of X operations in the loop > body, taking this information from the vec_ops analysis during vector > costing and the available issue_info information. > We multiply the dividend by a potential reduction_latency, to improve > our pipeline utilization if we are stalled waiting on a particular > reduction operation. > > Right now we also have a work around for vectorization choices where the > main loop uses a NEON mode and predication is available, such that if > the main loop makes use of a NEON pattern that is not directly supported > by SVE we do not unroll, as that might cause performance regressions in > cases where we would enter the original main loop's VF. As an example if > you have a loop where you could use AVG_CEIL with a V8HI mode, you would > originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated > epilogue, using other instructions. Whereas with the unrolling you would > end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping > the original 8x NEON. In the future, we could handle this differently, > by either using a different costing model for epilogues, or potentially > vectorizing more than one single epilogue. > > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_vector_costs): Define > determine_suggested_unroll_factor. > (determine_suggested_unroll_factor): New function. > (aarch64_vector_costs::finish_costs): Use > determine_suggested_unroll_factor. > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index > b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 > 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -15680,6 +15680,7 @@ private: > unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, > unsigned int); > bool prefer_unrolled_loop () const; > + unsigned int determine_suggested_unroll_factor (); > > /* True if we have performed one-time initialization based on the > vec_info. */ > @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count > *ops, > return sve_cycles_per_iter; > } > > +unsigned int > +aarch64_vector_costs::determine_suggested_unroll_factor () > +{ > + auto *issue_info = aarch64_tune_params.vec_costs->issue_info; > + if (!issue_info) > + return 1; > + bool sve = false; > + if (aarch64_sve_mode_p (m_vinfo->vector_mode))
Other code uses m_vec_flags & VEC_ANY_SVE for this. > + { > + if (!issue_info->sve) > + return 1; > + sve = true; > + } > + else > + { > + if (!issue_info->advsimd) > + return 1; The issue info should instead be taken from vec_ops.simd_issue_info () in the loop below. It can vary for each entry. > + /* If we are trying to unroll a NEON main loop that contains patterns s/a NEON/an Advanced SIMD/ > + that we do not support with SVE and we might use a predicated > + epilogue, we need to be conservative and block unrolling as this might > + lead to a less optimal loop for the first and only epilogue using the > + original loop's vectorization factor. > + TODO: Remove this constraint when we add support for multiple epilogue > + vectorization. */ > + if (partial_vectors_supported_p () > + && param_vect_partial_vector_usage != 0 > + && !TARGET_SVE2) > + { > + unsigned int i; > + stmt_vec_info stmt_vinfo; > + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo) > + { > + if (is_pattern_stmt_p (stmt_vinfo)) > + { > + gimple *stmt = stmt_vinfo->stmt; > + if (is_gimple_call (stmt) > + && gimple_call_internal_p (stmt)) > + { > + enum internal_fn ifn > + = gimple_call_internal_fn (stmt); > + switch (ifn) > + { > + case IFN_AVG_FLOOR: > + case IFN_AVG_CEIL: > + return 1; > + default: > + break; > + } > + } > + } > + } > + } I think we should instead record this during add_stmt_cost, like we do for the other data that this function uses. We need to drop the partial_vectors_supported_p () condition though: enabling SVE must not make Advanced SIMD code worse. So for now, Advanced SIMD-only code will have to suffer the missed unrolling opportunity too. > + } > + > + unsigned int max_unroll_factor = 1; > + aarch64_simd_vec_issue_info const *vec_issue > + = sve ? issue_info->sve : issue_info->advsimd; > + for (auto vec_ops : m_ops) > + { > + /* Limit unroll factor to 4 for now. */ > + unsigned int unroll_factor = 4; Would be good to make this an aarch64-specific --param in case people want to play with different values. > + unsigned int factor > + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; > + unsigned int temp; > + > + /* Sanity check, this should never happen. */ > + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) > + return 1; > + > + /* Check stores. */ > + if (vec_ops.stores > 0) > + { > + temp = CEIL (factor * vec_issue->stores_per_cycle, > + vec_ops.stores); > + unroll_factor = MIN (unroll_factor, temp); > + } > + > + /* Check loads. */ > + if (vec_ops.loads > 0) > + { > + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, > + vec_ops.loads); This should use loads + stores rather than just loads. > + unroll_factor = MIN (unroll_factor, temp); > + } > + > + /* Check general ops. */ > + if (vec_ops.general_ops > 0) > + { > + temp = CEIL (factor * vec_issue->general_ops_per_cycle, > + vec_ops.general_ops); > + unroll_factor = MIN (unroll_factor, temp); > + } In principle we should also take pred_ops into account when sve_issue_info () is nonnull, while ignoring the pred_ops associated with loop predication (added by analyze_loop_vinfo). I guess we should maintain two pred_op counts. That's just a suggestion for a possible future improvement though. It doesn't need to be done as part of this patch. Thanks, Richard > + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); > + } > + > + /* Make sure unroll factor is power of 2. */ > + return 1 << ceil_log2 (max_unroll_factor); > +} > + > /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary > and return the new cost. */ > unsigned int > @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const > vector_costs *uncast_scalar_costs) > if (loop_vinfo > && m_vec_flags > && aarch64_use_new_vector_costs_p ()) > - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > - m_costs[vect_body]); > + { > + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > + m_costs[vect_body]); > + m_suggested_unroll_factor = determine_suggested_unroll_factor (); > + } > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > the scalar code in the event of a tie, since there is more chance