RE: [PATCH] middle-end: Improve RTL expansion in expand_mul_overflow,

Roger Sayle Fri, 10 Jul 2020 03:12:59 -0700


Hi Richard,

> On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote:
>> > +        res = force_reg (mode, res);
>> 
>> In general, this can be dangerous performance-wise on targets where 
>> subregs are free.  If the move survives to the register allocators, 
>> it increases the risk that the move will become a machine insn.
>> (The RA will prefer to tie the registers, but that isn't guaranteed.)

I believe I can use exactly the same test case, but compiled on ARM, to
demonstrate that placing reused intermediate values in pseudo registers
instead of SUBREGs is potentially preferable and at least no worse
(certainly not "dangerous").

The minimal test case for t120_smul is:

void bar(void);
int t120_1smul (int x, int y)
{
  int r;
  if (__builtin_smul_overflow (x, y, &r))
    bar ();
  return r;
}

which when compiled on arm-linux-gnueabihf generates, both with and
without my patch, the very impressive, near optimal sequence:

t120_1smul:
        mov     r3, r0
        smull   r0, r2, r3, r1
        cmp     r2, r0, asr #31
        bxeq    lr
        str     lr, [sp, #-4]!
        sub     sp, sp, #12
        str     r0, [sp, #4]
        bl      bar
        ldr     r0, [sp, #4]
        add     sp, sp, #12
        ldr     pc, [sp], #4

Notice that the first argument (output) of smul, r0, is used multiple
times, but thanks to fwprop both the SUBREG and the pseudo forms
are equivalent to the result register, r0, which reload respects.  That
GCC puts the lopart of the widening multiplication directly in the
result register is  awesome.  The use of force_reg in my patch clearly
doesn't hurt.

In fact, the use of a pseudo helps insulate some dubious code
expanded by the ARM backend, the previous assignment to the register
that is forced into a pseudo register actually looks like:

(insn 10 9 11 2 (set (subreg:SI (reg:DI 120) 0)
        (ashiftrt:SI (subreg:SI (reg:DI 119) 4)
            (const_int 0 [0]))) "t120_1smul.c":5:7 -1
     (nil))

Fortunately this cryptic move instruction eventually gets cleaned up
by the middle-end.

[Minor aside this shift by zero can be avoided by following patch]
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index e15d286..5225df6 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -31324,7 +31324,10 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code, rtx 
out, rtx in,

          if (REG_P (out) && REG_P (in) && REGNO (out) != REGNO (in))
            emit_insn (SET (out, const0_rtx));
-         emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
+         if (adj_amount != const0_rtx)
+           emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
+         else
+           emit_insn (SET (out_down, in_up));
          if (code == ASHIFTRT)
            emit_insn (gen_ashrsi3 (out_up, in_up,
                                    GEN_INT (31)));

But to (perhaps) convince you (folks) that pseudo registers are preferable to 
SUBREGs, we need
to take a look at the first two instructions generated in the above test case.

Instead of the current:
        mov     r3, r0
        smull   r0, r2, r3, r1

The optimal sequence on ARM (if my understanding of the ARM documentation on 
register
constraints is correct) would be the single instruction:

        smull   r0, r2, r1, r0

As the first three argument registers must be different, but the fourth and 
final register doesn't
conflict with the first three.  We know that we require the first output to be 
r0, but we can use
the commutativity of multiplication to swap r0 and r1 in the third and fourth 
positions, saving
both an instruction and a register!

Alas my attempts to convince GCC to generate this sequence have been 
unsuccessful.
The closest I can get is by using something like:

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index a6a31f8..ece9175 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -2409,11 +2409,11 @@
 )

 (define_insn "<US>mull"
-  [(set (match_operand:SI 0 "s_register_operand" "=r,&r")
+  [(set (match_operand:SI 0 "s_register_operand" "=r,&r,&r,&r")
        (mult:SI
-        (match_operand:SI 2 "s_register_operand" "%r,r")
-        (match_operand:SI 3 "s_register_operand" "r,r")))
-   (set (match_operand:SI 1 "s_register_operand" "=r,&r")
+        (match_operand:SI 2 "s_register_operand" "%r,r,r,r")
+        (match_operand:SI 3 "s_register_operand" "r,r,0,1")))
+   (set (match_operand:SI 1 "s_register_operand" "=r,&r,&r,&r")
        (truncate:SI
         (lshiftrt:DI
          (mult:DI (SE:DI (match_dup 2)) (SE:DI (match_dup 3)))
@@ -2422,7 +2422,7 @@
   "<US>mull%?\\t%0, %1, %2, %3"
   [(set_attr "type" "umull")
    (set_attr "predicable" "yes")
-   (set_attr "arch" "v6,nov6")]
+   (set_attr "arch" "v6,nov6,nov6,nov6")]
 )

Which is one step closer, with reload/GCC now generating:

        mov     r3, r1
        smull   r0, r1, r3, r0

This has the correct form, with r0 as the first and fourth arguments, and 
requires
one less register but notice mysteriously we still have the (unnecessary) mov 
instruction!

I suspect that this may be a consequence of using SUBREGs instead of pseudo 
registers.
Perhaps, by insisting that the DI mode result of this instruction are 
constrained to exist
in consecutive registers, the ARM backend produces worse code than it needs to?

If anyone can convince reload to generate a single "smull r0, r2, r1, r0" for 
this
testcase on ARM, then I'll admit that they are a better compiler wrangler than 
me.

I hope this was convincing.  The take home message is that the single line 
force_reg
change really is safe.  I'm happy for follow up with design philosophy, and RTL 
sharing,
and ARM-relevant supporting arguments on why pseudo register usage is good.

Best regards,
Roger
--

RE: [PATCH] middle-end: Improve RTL expansion in expand_mul_overflow,

Reply via email to