Re: Add an "early rematerialisation" pass

Richard Sandiford Fri, 15 Dec 2017 16:31:57 -0800

Jeff Law <l...@redhat.com> writes:
> On 11/17/2017 08:58 AM, Richard Sandiford wrote:
>> This patch looks for pseudo registers that are live across a call
>> and for which no call-preserved hard registers exist.  It then
>> recomputes the pseudos as necessary to ensure that they are no
>> longer live across a call.  The comment at the head of the file
>> describes the approach.
>> 
>> A new target hook selects which modes should be treated in this way.
>> By default none are, in which case the pass is skipped very early.
>> 
>> It might also be worth looking for cases like:
>> 
>>    C1: R1 := f (...)
>>    ...
>>    C2: R2 := f (...)
>>    C3: R1 := C2
>> 
>> and giving the same value number to C1 and C3, effectively treating
>> it like:
>> 
>>    C1: R1 := f (...)
>>    ...
>>    C2: R2 := f (...)
>>    C3: R1 := f (...)
>> 
>> Another (much more expensive) enhancement would be to apply value
>> numbering to all pseudo registers (not just rematerialisation
>> candidates), so that we can handle things like:
>> 
>>   C1: R1 := f (...R2...)
>>   ...
>>   C2: R1 := f (...R3...)
>> 
>> where R2 and R3 hold the same value.  But the current pass seems
>> to catch the vast majority of cases.
>> 
>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>> and powerpc64le-linux-gnu.  OK to install?
>> 
>> Richard
>> 
>> 
>> 2017-11-17  Richard Sandiford  <richard.sandif...@linaro.org>
>> 
>> gcc/
>>      * Makefile.in (OBJS): Add early-remat.o.
>>      * target.def (select_early_remat_modes): New hook.
>>      * doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.
>>      * doc/tm.texi: Regenerate.
>>      * targhooks.h (default_select_early_remat_modes): Declare.
>>      * targhooks.c (default_select_early_remat_modes): New function.
>>      * timevar.def (TV_EARLY_REMAT): New timevar.
>>      * passes.def (pass_early_remat): New pass.
>>      * tree-pass.h (make_pass_early_remat): Declare.
>>      * early-remat.c: New file.
>>      * config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New
>>      function.
>>      (TARGET_SELECT_EARLY_REMAT_MODES): Define.
>> 
>> gcc/testsuite/
>>      * gcc.target/aarch64/sve_spill_1.c: Also test that no predicates
>>      are spilled.
>>      * gcc.target/aarch64/sve_spill_2.c: New test.
>>      * gcc.target/aarch64/sve_spill_3.c: Likewise.
>>      * gcc.target/aarch64/sve_spill_4.c: Likewise.
>>      * gcc.target/aarch64/sve_spill_5.c: Likewise.
>>      * gcc.target/aarch64/sve_spill_6.c: Likewise.
>>      * gcc.target/aarch64/sve_spill_7.c: Likewise.
> So it's not the immediate goal here, but it'd be nice if we could extend
> this to cover all the other targets that have no registers of certain
> classes that can be allocated across a call.
>
> Mostly this is exactly what I'd expect -- not to diminish the work,
> there's a ton of stuff here.  But the overall flow of work and
> organization is basically what I'd expect.


That's good.  I wasn't setting out to do anything particularly
innovative here, so the fact that it's not surprising is a positive
sign. :-)

> I know Richi was a bit scared off by the RTL nature, but the vast
> majority of the code here isn't related to RTL -- it's operating on
> properties that got extracted from the RTL.  I could almost take this
> change bits at the start of the pass and the end and run it on gimple
> :-)  Which brings up an interesting question, is that a worthwhile thing
> to ponder?  Is it gimple optimizers that are causing objects to be live
> across calls or is it usually teh RTL bits?

It tends to be a mixture of both.  E.g. if we have two loops with
the same bounds and a call inbetween, the gimple optimisers will reuse
the WHILE_ULT in the first loop header for the second loop.  But there
are also cases introduced in rtl.  E.g. we don't (and IMO shouldn't)
expose to gimple that some SVE operations always require a predicate.
The expanders instead generate all-true predicates where necessary.
These are then an obvious candidate for PRE.

> As I was reading one of the thoughts I had was that this reminded me a
> bit of LCM.  WHich makes sense -- it's code motion after all.  I was
> pleasantly surprised to see the comment explaining why the LCM
> infrastructure was not used.

Yeah, the availablity problem is basically the same.  The main
difference is choosing the final placement based on block frequencies.

> Do you have to do anything about rounding modes?  Or are those not an
> issue for aarch64?  It's something I'm pretty sure we'd need to handle
> for x86 as we have to ensure the rounding mode is stable if we move/copy
> FP computations from one context to another.

Rounding modes don't affect loads, stores or moves, which are just
bit copies.  I suppose there is the problem that we could rematerialise
an FP operation in a region with a different rounding mode, but we just
don't have an IL that tracks that properly (in general, not just here).

>> Index: gcc/early-remat.c
>> ===================================================================
>> --- /dev/null        2017-11-14 14:28:07.424493901 +0000
>> +++ gcc/early-remat.c        2017-11-17 15:56:53.021759920 +0000
>> @@ -0,0 +1,2610 @@
>> +   Candidate validation and value numbering
>> +   ========================================
>> +
>> +   Next we simultaneously decide which candidates are valid and look
>> +   for candidates that are equivalent to each other, assigning numbers
>> +   to each unique candidate value.  A candidate C is invalid if:
>> +
>> +     (a) C uses an invalid candidate;
>> +
>> +     (b) there is a cycle of candidate uses involving C; or
>> +
>> +     (c) C takes a candidate register R as input and the reaching
>> +         definitions of R do not have the same value number.
>> +
>> +   We assign a "representative" candidate C to each value number and from
>> +   here on replace references to other candidates with that value number
>> +   with references to C.  It is then only possible to rematerialize a
>> +   register R at point P if (after this replacement) there is a single
>> +   reaching definition of R at P.
> So how often are you seeing candidates that are equal to each other?  I
> know Vlad did some experiments in the register allocator and the number
> of equivalences seen by VN was relatively small.
>
> Maybe it's the case that there's something fundamentally different with
> what you're doing allowing you to see more equivalences.
>
> Of course if you're using VNs as part of the validation process, then I
> guess you get equivalences for free, so you might as well go ahead and
> use them.

It happened in at least one important case (can't remember which, sorry).
That was enough to get me to implement it without first measuring how
common it was.  Like you say, it almost came for free as part of
validation.

> Local/Global Phases seem reasonable at a high level.  About what you'd
> expect.

Thanks.

>> +/* Return true if REGNO is worth rematerializing.  */
>> +
>> +bool
>> +early_remat::interesting_regno_p (unsigned int regno)
>> +{
>> +  /* Ignore unused registers.  */
>> +  rtx reg = regno_reg_rtx[regno];
>> +  if (!reg || DF_REG_DEF_COUNT (regno) == 0)
>> +    return false;
>> +
>> +  /* Make sure the register has a mode that we want to rematerialize.  */
>> +  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))
>> +    return false;
>> +
>> +  /* Ignore values that might sometimes be used uninitialized.  We could
>> +     instead add dummy candidates for the entry block definition, and so
>> +     handle uses that are definitely not uninitialized, but the combination
>> +     of the two should be rare in practice.  */
>> +  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
>> +    return false;
>> +
>> +  return true;
>> +}
> So triggering on modes isn't really ideal, but I can understand why
> you're doing it that way.  I think it's OK for now, but there may be a
> day when we want to do something different.
>
> That could one day be dealing with an architecture with a painful to
> save register that is accessed in the same mode as other pseudos or
> dealing with a remat pass to reduce pressure.

I agree it might be useful to handle cases that aren't determined
solely by the mode.  It should be easy to plug that in when necessary.

I'm still not sure that this is the right place or algorithm for doing
remat to reduce pressure (but I'd be happy to be proven wrong :-)).

>> +/* Record the set of register REGNO in instruction INSN as a
>> +   rematerialization candidate.  CAN_COPY_P is true unless we already
>> +   know that rematerialization is impossible (in which case the candidate
>> +   only exists for the reaching definition calculation).
>> +
>> +   The candidate's index is not fixed at this stage.  */
>> +
>> +remat_candidate *
>> +early_remat::add_candidate (rtx_insn *insn, unsigned int regno,
>> +                        bool can_copy_p)
>> +{
>> +  remat_candidate cand;
>> +  memset (&cand, 0, sizeof (cand));
>> +  cand.regno = regno;
>> +  cand.insn = insn;
>> +  cand.remat_rtx = PATTERN (insn);
>> +  cand.can_copy_p = can_copy_p;
> FWIW, if the fields you're assigning are at the start or the end of the
> object, DSE will trim the memset to avoid unnecessary write traffic.
> No need to try and rearrange the object to enable that optimization.
> Just making you aware in case such things are of interest to you.

OK.  This particular code isn't performance-sensitive, but it's
definitely something that's worth bearing in mind.

>> +      else if (HARD_REGISTER_NUM_P (use_regno))
>> +    {
>> +      /* Allocate a dummy pseudo register and temporarily install it.
>> +         Make the register number depend on the mode, which should
>> +         provide enough sharing for match_dup while also weeding
>> +         out cases in which operands with different modes are
>> +         explicitly tied.  */
>> +      rtx *loc = DF_REF_REAL_LOC (ref);
>> +      unsigned int size = RTX_CODE_SIZE (REG);
>> +      rtx new_reg = (rtx) alloca (size);
>> +      memset (new_reg, 0, size);
>> +      PUT_CODE (new_reg, REG);
>> +      set_mode_and_regno (new_reg, GET_MODE (*loc),
>> +                          LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));
>> +      validate_change (insn, loc, new_reg, 1);
>> +    }
>> +    }
> Ewww.  But I guess it works OK and you undo it immediately.  There was
> another similar instance later.  They're not ideal, but they're well
> localized and I don't think worth making a big deal out of.

Yeah, it's not pretty, but I don't think we can do much better given
the current "try it and see" approach to insn recognition.

>> +
>> +/* Assuming that every path reaching a point P contains a copy of a
>> +   use U of REGNO, return true if another copy of U at P would have
>> +   access to the same value of REGNO.  */
>> +
>> +bool
>> +early_remat::stable_use_p (unsigned int regno)
>> +{
>> +  /* Conservatively assume not for hard registers.  */
>> +  if (HARD_REGISTER_NUM_P (regno))
>> +    return false;
>> +
>> +  /* See if REGNO has a single definition and is never used uninitialized.
>> +     In this case the definition of REGNO dominates the common dominator
>> +     of the uses U, which in turn dominates P.  */
>> +  if (DF_REG_DEF_COUNT (regno) == 1
>> +      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
>> +    return true;
>> +
>> +  return false;
>> +}
> What about loops here?  The def may dominate the use but the value might
> be changing each iteration of the loop.  Does that affect correctness of
> the algorithm you're using?

No, that's OK.  Since the pre-existing copies of the candidate
instruction CI all use REGNO, and since every path to the rematerialisation
point P includes a copy of CI, there can't be a path from the definition
of REGNO to P that doesn't also include a copy of CI.  So the value of
REGNO at P will always be the correct one.

>> +
>> +/* Emit a copy from DEST to SRC before candidate CAND_INDEX's
>> instruction.  */
>> +
>> +void
>> +early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)
>> +{
>> +  remat_candidate *cand = &m_candidates[cand_index];
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file, ";; Stabilizing insn ");
>> +      dump_insn_id (cand->insn);
>> + fprintf (dump_file, " by copying source reg %d:%s to temporary reg
>> %d\n",
>> +           REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));
>> +    }
>> +  emit_insn_before (gen_move_insn (dest, src), cand->insn);
> Might want to clarify that SRC and DEST are regs.  I was about to ask
> about it, but then saw the mention in the dumping code.

OK.

> Again, generally this is largely what I'd expect.  Just a few questions
> above to clarify a couple minor concerns, but I think this is fine.

Thanks,
Richard

Re: Add an "early rematerialisation" pass

Reply via email to