+ if (insn != arm_mve_get_loop_vctp (body))
+ {
probably a good idea to invert the condition here and return false,
helps reducing the indenting in this function.
+ /* Starting from the current insn, scan backwards through the insn
+ chain until BB_HEAD: "for each insn in the BB prior to the current".
+ */
There's a trailing whitespace after insn, but also I'd rewrite this bit.
The "for each insn in the BB prior to the current" is superfluous and
even confusing to me. How about:
"Scan backwards from the current INSN through the instruction chain
until the start of the basic block. "
I find 'that previous insn' to be confusing as you don't mention any
previous insn before. So how about something along the lines of:
'If a previous insn defines a register that INSN uses then return true
if...'
Do we need to check: 'insn != prev_insn' ? Any reason why you can't
start the loop with:
'for (rtx_insn *prev_insn = PREV_INSN (insn);'
Now I also found a case where things might go wrong in:
+ /* Look at all the DEFs of that previous insn: if one of them is on
+ the same REG as our current insn, then recurse in order to check
+ that insn's USEs. If any of these insns return true as
+ MVE_VPT_UNPREDICATED_INSN_Ps, then the whole chain is affected
+ by the change in behaviour from being placed in dlstp/letp loop.
+ */
+ df_ref prev_insn_defs = NULL;
+ FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+ {
+ if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+ && insn != prev_insn
+ && body == BLOCK_FOR_INSN (prev_insn)
+ && !arm_mve_vec_insn_is_predicated_with_this_predicate
+ (insn, vctp_vpr_generated)
+ && arm_mve_check_df_chain_back_for_implic_predic
+ (prev_insn, vctp_vpr_generated))
+ return true;
+ }
The body == BLOCK_FOR_INSN (prev_insn) hinted me at it, if a def comes
from outside of the BB (so outside of the loop's body) then its by
definition unpredicated by vctp. I think you want to check that if
prev_insn defines a register used by insn then return true if prev_insn
isn't in the same BB or has a chain that is not predicated, i.e.:
'!arm_mve_vec_insn_is_predicated_with_this_predicate (insn,
vctp_vpr_generated) && arm_mve_check_df_chain_back_for_implic_predic
prev_insn, vctp_vpr_generated))' you check body != BLOCK_FOR_INSN
(prev_insn)'
I also found some other issues, this currently loloops:
uint16_t test (uint16_t *a, int n)
{
uint16_t res =0;
while (n > 0)
{
mve_pred16_t p = vctp16q (n);
uint16x8_t va = vldrhq_u16 (a);
res = vaddvaq_u16 (res, va);
res = vaddvaq_p_u16 (res, va, p);
a += 8;
n -= 8;
}
return res;
}
But it shouldn't, this is because there's a lack of handling of across
vector instructions. Luckily in MVE all across vector instructions have
the side-effect that they write to a scalar register, even the vshlcq
instruction (it writes to a scalar carry output).
Did this lead me to find an ICE with:
uint16x8_t test (uint16_t *a, int n)
{
uint16x8_t res = vdupq_n_u16 (0);
while (n > 0)
{
uint16_t carry = 0;
mve_pred16_t p = vctp16q (n);
uint16x8_t va = vldrhq_u16 (a);
res = vshlcq_u16 (va, &carry, 1);
res = vshlcq_m_u16 (res, &carry, 1 , p);
a += 8;
n -= 8;
}
return res;
}
This is because:
+ /* If the USE is outside the loop body bb, or it is inside, but
+ is an unpredicated store to memory. */
+ if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+ || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+ (next_use_insn, vctp_vpr_generated)
+ && mve_memory_operand
+ (SET_DEST (single_set (next_use_insn)),
+ GET_MODE (SET_DEST (single_set (next_use_insn))))))
+ return true;
Assumes single_set doesn't return 0.
Let's deal with these issues and I'll continue to review.
On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
Hi all,
This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops. See my previous email
for a general introduction of MVE LOLs.
This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.
Mid-end changes are:
1) Relax the loop-doloop mechanism in the mid-end to allow for
decrement numbers other that -1 and for `count` to be an
rtx containing a simple REG (which in this case will contain
the number of elements to be processed), rather
than an expression for calculating the number of iterations.
2) Added a new df utility function: `df_bb_regno_only_def_find` that
will return the DEF of a REG only if it is DEF-ed once within the
basic block.
And many things in the backend to implement the above optimisation:
3) Implement the `arm_predict_doloop_p` target hook to instruct the
mid-end about Low Overhead Loops (MVE or not), as well as
`arm_loop_unroll_adjust` which will prevent unrolling of any loops
that are valid for becoming MVE Tail_Predicated Low Overhead Loops
(unrolling can transform a loop in ways that invalidate the dlstp/
letp tranformation logic and the benefit of the dlstp/letp loop
would be considerably higher than that of unrolling)
4) Appropriate changes to the define_expand of doloop_end, new
patterns for dlstp and letp, new iterators, unspecs, etc.
5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
* `arm_mve_dlstp_check_dec_counter`
* `arm_mve_dlstp_check_inc_counter`
* `arm_mve_check_reg_origin_is_num_elems`
* `arm_mve_check_df_chain_back_for_implic_predic`
* `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
This all, in smoe way or another, are running checks on the loop
structure in order to determine if the loop is valid for dlstp/letp
transformation.
6) `arm_attempt_dlstp_transform`: (called from the define_expand of
doloop_end) this function re-checks for the loop's suitability for
dlstp/letp transformation and then implements it, if possible.
7) Various utility functions:
*`arm_mve_get_vctp_lanes` to map
from vctp unspecs to number of lanes, and
`arm_get_required_vpr_reg`
to check an insn to see if it requires the VPR or not.
* `arm_mve_get_loop_vctp`
* `arm_mve_get_vctp_lanes`
* `arm_emit_mve_unpredicated_insn_to_seq`
* `arm_get_required_vpr_reg`
* `arm_get_required_vpr_reg_param`
* `arm_get_required_vpr_reg_ret_val`
* `arm_mve_vec_insn_is_predicated_with_this_predicate`
* `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?
Thank you,
Stam Markianos-Wright
gcc/ChangeLog:
* config/arm/arm-protos.h (arm_target_insn_ok_for_lob):
Rename to...
(arm_target_bb_ok_for_lob): ...this
(arm_attempt_dlstp_transform): New.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
(TARGET_PREDICT_DOLOOP_P): New.
(arm_block_set_vect):
(arm_target_insn_ok_for_lob): Rename from
arm_target_insn_ok_for_lob.
(arm_target_bb_ok_for_lob): New.
(arm_mve_get_vctp_lanes): New.
(arm_get_required_vpr_reg): New.
(arm_get_required_vpr_reg_param): New.
(arm_get_required_vpr_reg_ret_val): New.
(arm_mve_get_loop_vctp): New.
(arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
(arm_mve_vec_insn_is_predicated_with_this_predicate): New.
(arm_mve_check_df_chain_back_for_implic_predic): New.
(arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
(arm_mve_check_reg_origin_is_num_elems): New.
(arm_mve_dlstp_check_inc_counter): New.
(arm_mve_dlstp_check_dec_counter): New.
(arm_mve_loop_valid_for_dlstp): New.
(arm_predict_doloop_p): New.
(arm_loop_unroll_adjust): New.
(arm_emit_mve_unpredicated_insn_to_seq): New.
(arm_attempt_dlstp_transform): New.
* config/arm/iterators.md (DLSTP): New.
(mode1): Add DLSTP mappings.
* config/arm/mve.md (*predicated_doloop_end_internal): New.
(dlstp<mode1>_insn): New.
* config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
* config/arm/unspecs.md: New unspecs.
* df-core.cc (df_bb_regno_only_def_find): New.
* df.h (df_bb_regno_only_def_find): New.
* loop-doloop.cc (doloop_condition_get): Relax conditions.
(doloop_optimize): Add support for elementwise LoLs.
gcc/testsuite/ChangeLog:
* gcc.target/arm/lob.h: Update framework.
* gcc.target/arm/lob1.c: Likewise.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/mve/dlstp-compile-asm.c: New test.
* gcc.target/arm/mve/dlstp-int16x8.c: New test.
* gcc.target/arm/mve/dlstp-int32x4.c: New test.
* gcc.target/arm/mve/dlstp-int64x2.c: New test.
* gcc.target/arm/mve/dlstp-int8x16.c: New test.
* gcc.target/arm/mve/dlstp-invalid-asm.c: New test.