"Andre Vieira (lists)" <[email protected]> writes:
> Hi,
>
> This patch implements the costing function
> determine_suggested_unroll_factor for aarch64.
> It determines the unrolling factor by dividing the number of X
> operations we can do per cycle by the number of X operations in the loop
> body, taking this information from the vec_ops analysis during vector
> costing and the available issue_info information.
> We multiply the dividend by a potential reduction_latency, to improve
> our pipeline utilization if we are stalled waiting on a particular
> reduction operation.
>
> Right now we also have a work around for vectorization choices where the
> main loop uses a NEON mode and predication is available, such that if
> the main loop makes use of a NEON pattern that is not directly supported
> by SVE we do not unroll, as that might cause performance regressions in
> cases where we would enter the original main loop's VF. As an example if
> you have a loop where you could use AVG_CEIL with a V8HI mode, you would
> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated
> epilogue, using other instructions. Whereas with the unrolling you would
> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping
> the original 8x NEON. In the future, we could handle this differently,
> by either using a different costing model for epilogues, or potentially
> vectorizing more than one single epilogue.
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64.cc (aarch64_vector_costs): Define
> determine_suggested_unroll_factor.
> (determine_suggested_unroll_factor): New function.
> (aarch64_vector_costs::finish_costs): Use
> determine_suggested_unroll_factor.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index
> b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521
> 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -15680,6 +15680,7 @@ private:
> unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
> unsigned int);
> bool prefer_unrolled_loop () const;
> + unsigned int determine_suggested_unroll_factor ();
>
> /* True if we have performed one-time initialization based on the
> vec_info. */
> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count
> *ops,
> return sve_cycles_per_iter;
> }
>
> +unsigned int
> +aarch64_vector_costs::determine_suggested_unroll_factor ()
> +{
> + auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
> + if (!issue_info)
> + return 1;
> + bool sve = false;
> + if (aarch64_sve_mode_p (m_vinfo->vector_mode))
Other code uses m_vec_flags & VEC_ANY_SVE for this.
> + {
> + if (!issue_info->sve)
> + return 1;
> + sve = true;
> + }
> + else
> + {
> + if (!issue_info->advsimd)
> + return 1;
The issue info should instead be taken from vec_ops.simd_issue_info ()
in the loop below. It can vary for each entry.
> + /* If we are trying to unroll a NEON main loop that contains patterns
s/a NEON/an Advanced SIMD/
> + that we do not support with SVE and we might use a predicated
> + epilogue, we need to be conservative and block unrolling as this might
> + lead to a less optimal loop for the first and only epilogue using the
> + original loop's vectorization factor.
> + TODO: Remove this constraint when we add support for multiple epilogue
> + vectorization. */
> + if (partial_vectors_supported_p ()
> + && param_vect_partial_vector_usage != 0
> + && !TARGET_SVE2)
> + {
> + unsigned int i;
> + stmt_vec_info stmt_vinfo;
> + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
> + {
> + if (is_pattern_stmt_p (stmt_vinfo))
> + {
> + gimple *stmt = stmt_vinfo->stmt;
> + if (is_gimple_call (stmt)
> + && gimple_call_internal_p (stmt))
> + {
> + enum internal_fn ifn
> + = gimple_call_internal_fn (stmt);
> + switch (ifn)
> + {
> + case IFN_AVG_FLOOR:
> + case IFN_AVG_CEIL:
> + return 1;
> + default:
> + break;
> + }
> + }
> + }
> + }
> + }
I think we should instead record this during add_stmt_cost, like we do
for the other data that this function uses.
We need to drop the partial_vectors_supported_p () condition though:
enabling SVE must not make Advanced SIMD code worse. So for now,
Advanced SIMD-only code will have to suffer the missed unrolling
opportunity too.
> + }
> +
> + unsigned int max_unroll_factor = 1;
> + aarch64_simd_vec_issue_info const *vec_issue
> + = sve ? issue_info->sve : issue_info->advsimd;
> + for (auto vec_ops : m_ops)
> + {
> + /* Limit unroll factor to 4 for now. */
> + unsigned int unroll_factor = 4;
Would be good to make this an aarch64-specific --param in case people
want to play with different values.
> + unsigned int factor
> + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
> + unsigned int temp;
> +
> + /* Sanity check, this should never happen. */
> + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
> + return 1;
> +
> + /* Check stores. */
> + if (vec_ops.stores > 0)
> + {
> + temp = CEIL (factor * vec_issue->stores_per_cycle,
> + vec_ops.stores);
> + unroll_factor = MIN (unroll_factor, temp);
> + }
> +
> + /* Check loads. */
> + if (vec_ops.loads > 0)
> + {
> + temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
> + vec_ops.loads);
This should use loads + stores rather than just loads.
> + unroll_factor = MIN (unroll_factor, temp);
> + }
> +
> + /* Check general ops. */
> + if (vec_ops.general_ops > 0)
> + {
> + temp = CEIL (factor * vec_issue->general_ops_per_cycle,
> + vec_ops.general_ops);
> + unroll_factor = MIN (unroll_factor, temp);
> + }
In principle we should also take pred_ops into account when
sve_issue_info () is nonnull, while ignoring the pred_ops associated
with loop predication (added by analyze_loop_vinfo). I guess we
should maintain two pred_op counts.
That's just a suggestion for a possible future improvement though.
It doesn't need to be done as part of this patch.
Thanks,
Richard
> + max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
> + }
> +
> + /* Make sure unroll factor is power of 2. */
> + return 1 << ceil_log2 (max_unroll_factor);
> +}
> +
> /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary
> and return the new cost. */
> unsigned int
> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const
> vector_costs *uncast_scalar_costs)
> if (loop_vinfo
> && m_vec_flags
> && aarch64_use_new_vector_costs_p ())
> - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> - m_costs[vect_body]);
> + {
> + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> + m_costs[vect_body]);
> + m_suggested_unroll_factor = determine_suggested_unroll_factor ();
> + }
>
> /* Apply the heuristic described above m_stp_sequence_cost. Prefer
> the scalar code in the event of a tie, since there is more chance