Hi Richard,
> On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote: >> > + res = force_reg (mode, res); >> >> In general, this can be dangerous performance-wise on targets where >> subregs are free. If the move survives to the register allocators, >> it increases the risk that the move will become a machine insn. >> (The RA will prefer to tie the registers, but that isn't guaranteed.) I believe I can use exactly the same test case, but compiled on ARM, to demonstrate that placing reused intermediate values in pseudo registers instead of SUBREGs is potentially preferable and at least no worse (certainly not "dangerous"). The minimal test case for t120_smul is: void bar(void); int t120_1smul (int x, int y) { int r; if (__builtin_smul_overflow (x, y, &r)) bar (); return r; } which when compiled on arm-linux-gnueabihf generates, both with and without my patch, the very impressive, near optimal sequence: t120_1smul: mov r3, r0 smull r0, r2, r3, r1 cmp r2, r0, asr #31 bxeq lr str lr, [sp, #-4]! sub sp, sp, #12 str r0, [sp, #4] bl bar ldr r0, [sp, #4] add sp, sp, #12 ldr pc, [sp], #4 Notice that the first argument (output) of smul, r0, is used multiple times, but thanks to fwprop both the SUBREG and the pseudo forms are equivalent to the result register, r0, which reload respects. That GCC puts the lopart of the widening multiplication directly in the result register is awesome. The use of force_reg in my patch clearly doesn't hurt. In fact, the use of a pseudo helps insulate some dubious code expanded by the ARM backend, the previous assignment to the register that is forced into a pseudo register actually looks like: (insn 10 9 11 2 (set (subreg:SI (reg:DI 120) 0) (ashiftrt:SI (subreg:SI (reg:DI 119) 4) (const_int 0 [0]))) "t120_1smul.c":5:7 -1 (nil)) Fortunately this cryptic move instruction eventually gets cleaned up by the middle-end. [Minor aside this shift by zero can be avoided by following patch] diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index e15d286..5225df6 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -31324,7 +31324,10 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code, rtx out, rtx in, if (REG_P (out) && REG_P (in) && REGNO (out) != REGNO (in)) emit_insn (SET (out, const0_rtx)); - emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount))); + if (adj_amount != const0_rtx) + emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount))); + else + emit_insn (SET (out_down, in_up)); if (code == ASHIFTRT) emit_insn (gen_ashrsi3 (out_up, in_up, GEN_INT (31))); But to (perhaps) convince you (folks) that pseudo registers are preferable to SUBREGs, we need to take a look at the first two instructions generated in the above test case. Instead of the current: mov r3, r0 smull r0, r2, r3, r1 The optimal sequence on ARM (if my understanding of the ARM documentation on register constraints is correct) would be the single instruction: smull r0, r2, r1, r0 As the first three argument registers must be different, but the fourth and final register doesn't conflict with the first three. We know that we require the first output to be r0, but we can use the commutativity of multiplication to swap r0 and r1 in the third and fourth positions, saving both an instruction and a register! Alas my attempts to convince GCC to generate this sequence have been unsuccessful. The closest I can get is by using something like: diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index a6a31f8..ece9175 100644 --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -2409,11 +2409,11 @@ ) (define_insn "<US>mull" - [(set (match_operand:SI 0 "s_register_operand" "=r,&r") + [(set (match_operand:SI 0 "s_register_operand" "=r,&r,&r,&r") (mult:SI - (match_operand:SI 2 "s_register_operand" "%r,r") - (match_operand:SI 3 "s_register_operand" "r,r"))) - (set (match_operand:SI 1 "s_register_operand" "=r,&r") + (match_operand:SI 2 "s_register_operand" "%r,r,r,r") + (match_operand:SI 3 "s_register_operand" "r,r,0,1"))) + (set (match_operand:SI 1 "s_register_operand" "=r,&r,&r,&r") (truncate:SI (lshiftrt:DI (mult:DI (SE:DI (match_dup 2)) (SE:DI (match_dup 3))) @@ -2422,7 +2422,7 @@ "<US>mull%?\\t%0, %1, %2, %3" [(set_attr "type" "umull") (set_attr "predicable" "yes") - (set_attr "arch" "v6,nov6")] + (set_attr "arch" "v6,nov6,nov6,nov6")] ) Which is one step closer, with reload/GCC now generating: mov r3, r1 smull r0, r1, r3, r0 This has the correct form, with r0 as the first and fourth arguments, and requires one less register but notice mysteriously we still have the (unnecessary) mov instruction! I suspect that this may be a consequence of using SUBREGs instead of pseudo registers. Perhaps, by insisting that the DI mode result of this instruction are constrained to exist in consecutive registers, the ARM backend produces worse code than it needs to? If anyone can convince reload to generate a single "smull r0, r2, r1, r0" for this testcase on ARM, then I'll admit that they are a better compiler wrangler than me. I hope this was convincing. The take home message is that the single line force_reg change really is safe. I'm happy for follow up with design philosophy, and RTL sharing, and ARM-relevant supporting arguments on why pseudo register usage is good. Best regards, Roger --