On Wed, 13 Nov 2024, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Tue, 12 Nov 2024, Richard Sandiford wrote:
> >
> >> Sorry for the slow review.  I think Jeff's much better placed to comment
> >> on this than I am, but here's a stab.  Mostly it looks really good to me
> >> FWIW.
> >> 
> >> Andrew Carlotti <andrew.carlo...@arm.com> writes:
> >> > This pass is used to optimise assignments to the FPMR register in
> >> > aarch64.  I chose to implement this as a middle-end pass because it
> >> > mostly reuses the existing RTL PRE code within gcse.cc.
> >> >
> >> > Compared to RTL PRE, the key difference in this new pass is that we
> >> > insert new writes directly to the destination hardreg, instead of
> >> > writing to a new pseudo-register and copying the result later.  This
> >> > requires changes to the analysis portion of the pass, because sets
> >> > cannot be moved before existing instructions that set, use or clobber
> >> > the hardreg, and the value becomes unavailable after any uses of
> >> > clobbers of the hardreg.
> >> >
> >> > This patch would currently break any debug instructions that use the
> >> > value of fpmr in a region of code where that value is changed by this
> >> > pass.  I haven't worked out the best way to fix this, but I suspect the
> >> > issue is uncommon and tricky enough that it would be best to just drop
> >> > those debug instructions.
> >> 
> >> Yeah, good question, and pass on that :)  Will need to think more about it.
> >> 
> >> > I've bootstrapped and regression tested this on aarch64, and it should 
> >> > be NFC
> >> > on other targets.  Aside from this, my testing so far has involved 
> >> > hacking in a
> >> > single FP8 intrinsic and testing various parameters and control flow
> >> > structures, and checking both the codegen and the LCM bitmaps.  I intend 
> >> > to
> >> > write better and more comprehensive tests once there are some real 
> >> > intrinsic
> >> > implementations available to use.
> >> >
> >> >
> >> > Is this approach good?  Apart from fixing the debug instructions and
> >> > adding tests, is there anything else I need to change?
> >> >
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> >  * config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> >> >  * gcse.cc (doing_hardreg_pre_p): New global variable.
> >> >  (current_hardreg_regno): Ditto.
> >> >  (compute_local_properties): Unset transp for hardreg clobbers.
> >> >  (prune_hardreg_uses): New.
> >> >  (want_to_gcse_p): Always return true for hardreg PRE.
> >> >  (hash_scan_set): Add checks for hardreg uses/clobbers.
> >> >  (oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> >> >  (record_last_mem_set_info): Ditto.
> >> >  (compute_hash_table_work): Record hardreg uses.
> >> >  (prune_expressions): Mark hardreg sets as call-clobbered.
> >> >  (compute_pre_data): Add call to prune_hardreg_uses.
> >> >  (pre_expr_reaches_here_p_work): Add comment.
> >> >  (insert_insn_start_basic_block): New functions.
> >> >  (pre_edge_insert): Don't add hardreg sets to predecessor block.
> >> >  (pre_delete): Use hardreg for the reaching reg.
> >> >  (pre_gcse): Don't insert copies for hardreg PRE.
> >> >  (one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> >> >  (execute_hardreg_pre): New.
> >> >  (class pass_hardreg_pre): New.
> >> >  (pass_hardreg_pre::gate): New.
> >> >  (make_pass_hardreg_pre): New.
> >> >  * passes.def (pass_hardreg_pre): New pass.
> >> >  * tree-pass.h (make_pass_hardreg_pre): New.
> >> >
> >> > [...]
> >> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap 
> >> > *comp, sbitmap *antloc,
> >> >          }
> >> >      }
> >> >  }
> >> > +
> >> > +/* A hardreg set is not transparent in a block if there are any uses of 
> >> > that
> >> > +   hardreg.  This filters the results of compute_local_properties, 
> >> > after the
> >> > +   result of that function has been used to define the kills bitmap.
> >> 
> >> I think this is mostly my ignorance of the code, and would be obvious
> >> if I tried it out locally, but: why do we need to do this after
> >> computing the kills bitmap?  For mode-switching, the kills bitmap
> >> is the inverse of the transparency bitmap, but it sounds like here
> >> you want the kills bitmap to be more selective.
> >> 
> >> > +
> >> > +   TRANSP is the destination sbitmap to be updated.
> >> > +
> >> > +   TABLE controls which hash table to look at.  */
> >> > +
> >> > +static void
> >> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
> >> > +{
> >> > +  unsigned int i;
> >> > +  gcc_assert (doing_hardreg_pre_p);
> >> > +
> >> > +  for (i = 0; i < table->size; i++)
> >> > +    {
> >> > +      struct gcse_expr *expr;
> >> > +
> >> > +      for (expr = table->table[i]; expr != NULL; expr = 
> >> > expr->next_same_hash)
> >> > +        {
> >> > +          int indx = expr->bitmap_index;
> >> > +          df_ref def;
> >> > +
> >> > +          for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
> >> > +               def;
> >> > +               def = DF_REF_NEXT_REG (def))
> >> > +            bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> >> > +        }
> >> > +    }
> >> > +}
> >> >  
> >> >  /* Hash table support.  */
> >>  
> >> 
> >> > @@ -747,6 +804,9 @@ static basic_block current_bb;
> >> >  static bool
> >> >  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT 
> >> > *max_distance_ptr)
> >> >  {
> >> > +  if (doing_hardreg_pre_p)
> >> > +    return true;
> >> > +
> >> >  #ifdef STACK_REGS
> >> >    /* On register stack architectures, don't GCSE constants from the
> >> >       constant pool, as the benefits are often swamped by the overhead
> >> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, 
> >> > bool avail_p)
> >> >        }
> >> >  
> >> >      case MEM:
> >> > -      if (! flag_gcse_lm
> >> > +      if (! flag_gcse_lm || doing_hardreg_pre_p
> >> 
> >> This test occurs often enough that I think it's worth splitting out.
> >> Something like: !do_load_motion ()?
> >> 
> >> >            || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
> >> >                                       x, avail_p))
> >> >          return false;
> >> > [...]
> >> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d 
> >> > *table)
> >> >              }
> >> >  
> >> >            note_stores (insn, record_last_set_info, insn);
> >> > +
> >> > +          if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
> >> > +            {
> >> > +              /* We need to record the first use of a hardreg to 
> >> > determine if a
> >> > +                 set of that hardreg is anticipatable.  */
> >> > +              df_ref ref;
> >> > +              FOR_EACH_INSN_USE (ref, insn)
> >> > +                if (DF_REF_REGNO (ref) == current_hardreg_regno)
> >> > +                  {
> >> > +                    hardreg_last_bb = current_bb;
> >> > +                    hardreg_first_use = DF_INSN_LUID (insn);
> >> > +                  }
> >> > +            }
> >> >          }
> >> 
> >> Couldn't we instead check whether the register is live on entry to the 
> >> block?
> >> That would avoid the extra bit of state.
> >> 
> >> >  
> >> >        /* The next pass builds the hash table.  */
> >> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
> >> >      {
> >> >        for (expr = expr_hash_table.table[ui]; expr; expr = 
> >> > expr->next_same_hash)
> >> >          {
> >> > +          /* For hardreg pre, we assume that all relevant hardregs are
> >> > +             call-clobbered, and set all bits in prune_exprs if the reg 
> >> > is call
> >> > +             clobbered.
> >> 
> >> Not sure I understand this.  But...
> >> 
> >> >                        If the hardreg were merely call-used, then we 
> >> > would
> >> > +             need to remove the expression from the anticipatable and
> >> > +             transparent bitmaps only (after using this to compute the 
> >> > kills
> >> > +             bitmap).  */
> >> > +
> >> > +          if (doing_hardreg_pre_p)
> >> > +            {
> >> > +              bitmap_set_bit (prune_exprs, expr->bitmap_index);
> >> > +              continue;
> >> > +            }
> >> > +
> >> 
> >> ...the effect seems to be to set every bit of prune_exprs, in which
> >> case it might be easier to skip this loop entirely and adjust the later
> >> one to use bitmap_set_range.
> >> 
> >> >            /* Note potentially trapping expressions.  */
> >> >            if (may_trap_p (expr->expr))
> >> >              {
> >> > [...]
> >> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
> >> >    return 0;
> >> >  }
> >> >  
> >> > +static unsigned int
> >> > +execute_hardreg_pre (void)
> >> > +{
> >> > +  doing_hardreg_pre_p = true;
> >> > +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
> >> > +  /* It's possible to avoid this loop, but it isn't worth doing so until
> >> > +     hardreg PRE is used for multiple hardregs.  */
> >> 
> >> Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
> >> to structure the code so that this just works?  Where are the main
> >> difficulties?  Having to maintain a list of which expressions are
> >> associated with which register, and therefore which expressions
> >> mutually kill each other?
> >> 
> >> > +  for (int i = 0; regnos[i] != 0; i++)
> >> > +    {
> >> > +      int changed;
> >> > +      current_hardreg_regno = regnos[i];
> >> > +      if (dump_file)
> >> > +        fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
> >> > +                current_hardreg_regno);
> >> > +      delete_unreachable_blocks ();
> >> > +      df_analyze ();
> >> > +      changed = one_pre_gcse_pass ();
> >> > +      flag_rerun_cse_after_global_opts |= changed;
> >> 
> >> Is this appropriate for the new pass?  We're not really exposing general
> >> CSE opportunities.
> >> 
> >> > +      if (changed)
> >> > +        cleanup_cfg (0);
> >> > +    }
> >> > +  doing_hardreg_pre_p = false;
> >> > +  return 0;
> >> > +}
> >> > +
> >> >  static unsigned int
> >> >  execute_rtl_hoist (void)
> >> >  {
> >> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
> >> >  
> >> >  namespace {
> >> >  
> >> > +const pass_data pass_data_hardreg_pre =
> >> > +{
> >> > +  RTL_PASS, /* type */
> >> > +  "hardreg_pre", /* name */
> >> > +  OPTGROUP_NONE, /* optinfo_flags */
> >> > +  TV_PRE, /* tv_id */
> >> > +  PROP_cfglayout, /* properties_required */
> >> > +  0, /* properties_provided */
> >> > +  0, /* properties_destroyed */
> >> > +  0, /* todo_flags_start */
> >> > +  TODO_df_finish, /* todo_flags_finish */
> >> > +};
> >> > +
> >> > +class pass_hardreg_pre : public rtl_opt_pass
> >> > +{
> >> > +public:
> >> > +  pass_hardreg_pre (gcc::context *ctxt)
> >> > +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
> >> > +  {}
> >> > +
> >> > +  /* opt_pass methods: */
> >> > +  bool gate (function *) final override;
> >> > +  unsigned int execute (function *)  final override
> >> > +  {
> >> > +    return execute_hardreg_pre ();
> >> > +  }
> >> > +
> >> > +}; // class pass_rtl_pre
> >> > +
> >> > +bool
> >> > +pass_hardreg_pre::gate (function *fun)
> >> > +{
> >> > +#ifdef HARDREG_PRE_REGNOS
> >> > +  return optimize > 0
> >> > +    && !fun->calls_setjmp;
> >> 
> >> Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
> >> have expected them to be needed now, since the modern cfg framework
> >> should represent setjmp correctly.  Jeff, do you agree?  I'll try
> >> removing them and see what breaks...
> >
> > The problem with setjmp on RTL is that we fail to preserve abnormal
> > edges during RTL expansion and fail in the impossible job to recover
> > all that are required for correctness (to prevent code motion).
> 
> Ah, ok.  I'd wrongly thought that that was a solved problem now.
> 
> (Kind-of curious why it's impossible to recover the info.  I guess that's
> a bit of a distraction though.  Leaning further into reconstructing
> rather than preserving the cfg would be the wrong direction anyway.)

Let me refer you to PR57067, I'm not sure somebody spent serious time
in trying to fix reconstruction - it should be possible to recover
a conservative set of abnormal edges, but on the GIMPLE CFG side
we've already started with the required set before inlining and
manage to keep a more optimal set as one would conservatively
compute _after_ inlining.  You'd lose that, so I believe the best
thing would be to _not_ throw away abnormal edges but at least
preserve whether the currently expanding block had any outgoing
one and the set of target blocks.  The other issue is that the
target of abnormal edges has to be adjusted to the sub-block
which will actually receive it.

Richard.

> Thanks,
> Richard
> 
> > With PRE that only operates on hardregs and in particular does not
> > introduce alternate register usages or move memory ops there might
> > be no issue.
> >
> > Richard.
> >
> >> Thanks,
> >> Richard
> >> 
> >> > +#else
> >> > +  return false;
> >> > +#endif
> >> > +}
> >> > +
> >> > +} // anon namespace
> >> > +
> >> > +rtl_opt_pass *
> >> > +make_pass_hardreg_pre (gcc::context *ctxt)
> >> > +{
> >> > +  return new pass_hardreg_pre (ctxt);
> >> > +}
> >> > +
> >> > +namespace {
> >> > +
> >> >  const pass_data pass_data_rtl_hoist =
> >> >  {
> >> >    RTL_PASS, /* type */
> >> > diff --git a/gcc/passes.def b/gcc/passes.def
> >> > index 
> >> > 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2
> >> >  100644
> >> > --- a/gcc/passes.def
> >> > +++ b/gcc/passes.def
> >> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
> >> >        NEXT_PASS (pass_rtl_cprop);
> >> >        NEXT_PASS (pass_rtl_pre);
> >> >        NEXT_PASS (pass_rtl_hoist);
> >> > +      NEXT_PASS (pass_hardreg_pre);
> >> >        NEXT_PASS (pass_rtl_cprop);
> >> >        NEXT_PASS (pass_rtl_store_motion);
> >> >        NEXT_PASS (pass_cse_after_global_opts);
> >> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> >> > index 
> >> > a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f
> >> >  100644
> >> > --- a/gcc/tree-pass.h
> >> > +++ b/gcc/tree-pass.h
> >> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 
> >> > (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> >> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context 
> >> > *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
> >> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to