Re: [PATCH v2] AArch64: Block combine_and_move from creating FP literal loads

Richard Sandiford Thu, 09 Jan 2025 06:45:31 -0800

Wilco Dijkstra <wilco.dijks...@arm.com> writes:
> Hi Richard,
>
>> The patch below is what I meant.  It passes bootstrap & regression-test
>> on aarch64-linux-gnu (and so produces the same results for the tests
>> that you changed).  Do you see any problems with this version?
>> If not, I think we should go with it.
>
> Thanks for the detailed example - unfortunately there are issues with it.
> Early expansion means more instructions to deal with in RTL and fewer
> optimizations - it even affects inlining (I see more calls/returns in the
> instruction frequencies).


Do you have an example?  It seems odd that rtl changes after cfgexpand
would affect inlining.

> Worse, this change completely disables rematerialization of FP immediates
> which implies extra spilling. A basic example goes like this:
>
> void g(void);
> double bad_remat (double x)
> {
>   x += 5.347897294;
>   g();
>   x *= 5.347897294;
>   return x;
> }
>
> which with -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10 
> -ffixed-d11 -ffixed-d12 -ffixed-d13 -ffixed-d14 now compiles to:
>
>         adrp    x0, .LC0
>         str     x30, [sp, -32]!
>         ldr     d31, [x0, #:lo12:.LC0]
>         str     d15, [sp, 8]
>         fadd    d15, d0, d31
>         str     d31, [sp, 24]
>         bl      g
>         ldr     d31, [sp, 24]
>         fmul    d0, d15, d31
>         ldr     d15, [sp, 8]
>         ldr     x30, [sp], 32
>         ret

Hmm, interesting.  Thanks for the example.

The reason this works with your patch and not mine seems to be a direct
consequence of the fix for PR37273:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37273

It means that, if we start with:

   (set (reg R) (mem const_pool))
     REG_EQUIV: legitimate_const

then the final costing pass starts out with a significantly negative
memory cost for R.  Assigning memory is taken to be a saving:

  if (set != 0 && REG_P (SET_DEST (set)) && MEM_P (SET_SRC (set))
      && (note = find_reg_note (insn, REG_EQUIV, NULL_RTX)) != NULL_RTX
      && ((MEM_P (XEXP (note, 0))
           && !side_effects_p (SET_SRC (set)))
          || (CONSTANT_P (XEXP (note, 0))
              && targetm.legitimate_constant_p (GET_MODE (SET_DEST (set)),
                                                XEXP (note, 0))
              && REG_N_SETS (REGNO (SET_DEST (set))) == 1))
      && general_operand (SET_SRC (set), GET_MODE (SET_SRC (set)))
      /* LRA does not use equiv with a symbol for PIC code.  */
      && (! ira_use_lra_p || ! pic_offset_table_rtx
          || ! contains_symbol_ref_p (XEXP (note, 0))))

Thus for your patch IRA assigns a call-clobbered register while for
my patch it assumes that memory is cheaper than using caller saves.
If I disable the CONSTANT_P part of the check, to mimic what IRA
does with your patch, then I get the expected output.

The patch for PR37273 predates LRA and so was aimed at reload's
version of caller saves.  AIUI, the problem there was that the
caller saves didn't take advantage of equivalences and so would
always save to the stack.  LRA is smarter than that, so I suspect
the check could be dropped.  We should then get better code in
cases where the rematerialised pseudo is used multiple times
between two calls.

Even so, the code above is clearly assuming that the equivalence
will be replaced by a memory where necessary.  That doesn't happen
due to the following code in lra_constraints:

                /* If it is not a reverse equivalence, we check that a
                   pseudo in rhs of the init insn is not dying in the
                   insn.  Otherwise, the live info at the beginning of
                   the corresponding BB might be wrong after we
                   removed the insn.  When the equiv can be a
                   constant, the right hand side of the init insn can
                   be a pseudo.  */
                || (! reverse_equiv_p (i)
                    && (init_insn_rhs_dead_pseudo_p (i)
                        /* If we reloaded the pseudo in an equivalence
                           init insn, we cannot remove the equiv init
                           insns and the init insns might write into
                           const memory in this case.  */
                        || contains_reloaded_insn_p (i)))

This specifically mentions constants, but the behaviour seems odd
for them.  We don't use init_insns directly to rematerialise the
constant.  We simply replace the pseudo with the constant and reload
it afresh.

So I wouldn't have expected the check above to be necessary for
constants.  In particular, we check elsewhere whether the constant
is legitimate, and things like that.  Adding !CONSTANT_P (x)
to that condition also "fixes" the testcase.

Note that if you change the constant in your example to (say) 7902,
the output with current trunk is similarly bad:

        str     x30, [sp, -32]!
        mov     x0, 244091581366272
        movk    x0, 0x40be, lsl 48
        str     d15, [sp, 8]
        fmov    d15, x0
        fadd    d0, d0, d15
        str     d0, [sp, 24]
        bl      g
        ldr     d0, [sp, 24]
        fmul    d0, d0, d15
        ldr     d15, [sp, 8]
        ldr     x30, [sp], 32
        ret

Even without the secondary reload you mention below, the RA should be
able to reload from the constant pool rather than the stack.

So ISTM that there is an underlying problem that current trunk
happens to work around in some cases.  I don't think what the
move patterns are doing is right on first principles.

> Recent changes have been moving in the opposite direction - keeping
> high-level constructs (like GOT accesses) as a single operation works out
> better for register allocation and allows more optimization.

But I think we should look more into why that is, at least in this case.
(I've tried to start that process above.)  The effect is to hide address
arithmetic from the main allocator (IRA) and only expose it during LRA.
This means that, in high register GPR pressure, LRA has to scavenge and
spill more than it needs to.

Since having insn conditions dependent on can_create_pseudo_p is IMO
wrong in principle, and since the patch wasn't approved in its committed
form, I think we should revert it for GCC 15 and revisit this for GCC 16.

Thanks,
Richard

> So keeping FP immediates as standard move instructions until regalloc
> is best. Supporting MOV/FMOV in regalloc would require another secondary
> reload (and would then allow rematerialization of these constants).
>
> Cheers,
> Wilco

Re: [PATCH v2] AArch64: Block combine_and_move from creating FP literal loads

Reply via email to