Richard Biener <rguent...@suse.de> writes:
> On Tue, 12 Nov 2024, Richard Sandiford wrote:
>
>> Sorry for the slow review.  I think Jeff's much better placed to comment
>> on this than I am, but here's a stab.  Mostly it looks really good to me
>> FWIW.
>> 
>> Andrew Carlotti <andrew.carlo...@arm.com> writes:
>> > This pass is used to optimise assignments to the FPMR register in
>> > aarch64.  I chose to implement this as a middle-end pass because it
>> > mostly reuses the existing RTL PRE code within gcse.cc.
>> >
>> > Compared to RTL PRE, the key difference in this new pass is that we
>> > insert new writes directly to the destination hardreg, instead of
>> > writing to a new pseudo-register and copying the result later.  This
>> > requires changes to the analysis portion of the pass, because sets
>> > cannot be moved before existing instructions that set, use or clobber
>> > the hardreg, and the value becomes unavailable after any uses of
>> > clobbers of the hardreg.
>> >
>> > This patch would currently break any debug instructions that use the
>> > value of fpmr in a region of code where that value is changed by this
>> > pass.  I haven't worked out the best way to fix this, but I suspect the
>> > issue is uncommon and tricky enough that it would be best to just drop
>> > those debug instructions.
>> 
>> Yeah, good question, and pass on that :)  Will need to think more about it.
>> 
>> > I've bootstrapped and regression tested this on aarch64, and it should be 
>> > NFC
>> > on other targets.  Aside from this, my testing so far has involved hacking 
>> > in a
>> > single FP8 intrinsic and testing various parameters and control flow
>> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
>> > write better and more comprehensive tests once there are some real 
>> > intrinsic
>> > implementations available to use.
>> >
>> >
>> > Is this approach good?  Apart from fixing the debug instructions and
>> > adding tests, is there anything else I need to change?
>> >
>> >
>> > gcc/ChangeLog:
>> >
>> >    * config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
>> >    * gcse.cc (doing_hardreg_pre_p): New global variable.
>> >    (current_hardreg_regno): Ditto.
>> >    (compute_local_properties): Unset transp for hardreg clobbers.
>> >    (prune_hardreg_uses): New.
>> >    (want_to_gcse_p): Always return true for hardreg PRE.
>> >    (hash_scan_set): Add checks for hardreg uses/clobbers.
>> >    (oprs_unchanged_p): Disable load motion for hardreg PRE pass.
>> >    (record_last_mem_set_info): Ditto.
>> >    (compute_hash_table_work): Record hardreg uses.
>> >    (prune_expressions): Mark hardreg sets as call-clobbered.
>> >    (compute_pre_data): Add call to prune_hardreg_uses.
>> >    (pre_expr_reaches_here_p_work): Add comment.
>> >    (insert_insn_start_basic_block): New functions.
>> >    (pre_edge_insert): Don't add hardreg sets to predecessor block.
>> >    (pre_delete): Use hardreg for the reaching reg.
>> >    (pre_gcse): Don't insert copies for hardreg PRE.
>> >    (one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
>> >    (execute_hardreg_pre): New.
>> >    (class pass_hardreg_pre): New.
>> >    (pass_hardreg_pre::gate): New.
>> >    (make_pass_hardreg_pre): New.
>> >    * passes.def (pass_hardreg_pre): New pass.
>> >    * tree-pass.h (make_pass_hardreg_pre): New.
>> >
>> > [...]
>> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap 
>> > *comp, sbitmap *antloc,
>> >    }
>> >      }
>> >  }
>> > +
>> > +/* A hardreg set is not transparent in a block if there are any uses of 
>> > that
>> > +   hardreg.  This filters the results of compute_local_properties, after 
>> > the
>> > +   result of that function has been used to define the kills bitmap.
>> 
>> I think this is mostly my ignorance of the code, and would be obvious
>> if I tried it out locally, but: why do we need to do this after
>> computing the kills bitmap?  For mode-switching, the kills bitmap
>> is the inverse of the transparency bitmap, but it sounds like here
>> you want the kills bitmap to be more selective.
>> 
>> > +
>> > +   TRANSP is the destination sbitmap to be updated.
>> > +
>> > +   TABLE controls which hash table to look at.  */
>> > +
>> > +static void
>> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
>> > +{
>> > +  unsigned int i;
>> > +  gcc_assert (doing_hardreg_pre_p);
>> > +
>> > +  for (i = 0; i < table->size; i++)
>> > +    {
>> > +      struct gcse_expr *expr;
>> > +
>> > +      for (expr = table->table[i]; expr != NULL; expr = 
>> > expr->next_same_hash)
>> > +  {
>> > +    int indx = expr->bitmap_index;
>> > +    df_ref def;
>> > +
>> > +    for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
>> > +         def;
>> > +         def = DF_REF_NEXT_REG (def))
>> > +      bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
>> > +  }
>> > +    }
>> > +}
>> >  
>> >  /* Hash table support.  */
>>  
>> 
>> > @@ -747,6 +804,9 @@ static basic_block current_bb;
>> >  static bool
>> >  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
>> >  {
>> > +  if (doing_hardreg_pre_p)
>> > +    return true;
>> > +
>> >  #ifdef STACK_REGS
>> >    /* On register stack architectures, don't GCSE constants from the
>> >       constant pool, as the benefits are often swamped by the overhead
>> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, 
>> > bool avail_p)
>> >        }
>> >  
>> >      case MEM:
>> > -      if (! flag_gcse_lm
>> > +      if (! flag_gcse_lm || doing_hardreg_pre_p
>> 
>> This test occurs often enough that I think it's worth splitting out.
>> Something like: !do_load_motion ()?
>> 
>> >      || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
>> >                                 x, avail_p))
>> >    return false;
>> > [...]
>> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d 
>> > *table)
>> >        }
>> >  
>> >      note_stores (insn, record_last_set_info, insn);
>> > +
>> > +    if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
>> > +      {
>> > +        /* We need to record the first use of a hardreg to determine if a
>> > +           set of that hardreg is anticipatable.  */
>> > +        df_ref ref;
>> > +        FOR_EACH_INSN_USE (ref, insn)
>> > +          if (DF_REF_REGNO (ref) == current_hardreg_regno)
>> > +            {
>> > +              hardreg_last_bb = current_bb;
>> > +              hardreg_first_use = DF_INSN_LUID (insn);
>> > +            }
>> > +      }
>> >    }
>> 
>> Couldn't we instead check whether the register is live on entry to the block?
>> That would avoid the extra bit of state.
>> 
>> >  
>> >        /* The next pass builds the hash table.  */
>> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
>> >      {
>> >        for (expr = expr_hash_table.table[ui]; expr; expr = 
>> > expr->next_same_hash)
>> >    {
>> > +    /* For hardreg pre, we assume that all relevant hardregs are
>> > +       call-clobbered, and set all bits in prune_exprs if the reg is call
>> > +       clobbered.
>> 
>> Not sure I understand this.  But...
>> 
>> >                        If the hardreg were merely call-used, then we would
>> > +       need to remove the expression from the anticipatable and
>> > +       transparent bitmaps only (after using this to compute the kills
>> > +       bitmap).  */
>> > +
>> > +    if (doing_hardreg_pre_p)
>> > +      {
>> > +        bitmap_set_bit (prune_exprs, expr->bitmap_index);
>> > +        continue;
>> > +      }
>> > +
>> 
>> ...the effect seems to be to set every bit of prune_exprs, in which
>> case it might be easier to skip this loop entirely and adjust the later
>> one to use bitmap_set_range.
>> 
>> >      /* Note potentially trapping expressions.  */
>> >      if (may_trap_p (expr->expr))
>> >        {
>> > [...]
>> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
>> >    return 0;
>> >  }
>> >  
>> > +static unsigned int
>> > +execute_hardreg_pre (void)
>> > +{
>> > +  doing_hardreg_pre_p = true;
>> > +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
>> > +  /* It's possible to avoid this loop, but it isn't worth doing so until
>> > +     hardreg PRE is used for multiple hardregs.  */
>> 
>> Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
>> to structure the code so that this just works?  Where are the main
>> difficulties?  Having to maintain a list of which expressions are
>> associated with which register, and therefore which expressions
>> mutually kill each other?
>> 
>> > +  for (int i = 0; regnos[i] != 0; i++)
>> > +    {
>> > +      int changed;
>> > +      current_hardreg_regno = regnos[i];
>> > +      if (dump_file)
>> > +  fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
>> > +          current_hardreg_regno);
>> > +      delete_unreachable_blocks ();
>> > +      df_analyze ();
>> > +      changed = one_pre_gcse_pass ();
>> > +      flag_rerun_cse_after_global_opts |= changed;
>> 
>> Is this appropriate for the new pass?  We're not really exposing general
>> CSE opportunities.
>> 
>> > +      if (changed)
>> > +  cleanup_cfg (0);
>> > +    }
>> > +  doing_hardreg_pre_p = false;
>> > +  return 0;
>> > +}
>> > +
>> >  static unsigned int
>> >  execute_rtl_hoist (void)
>> >  {
>> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
>> >  
>> >  namespace {
>> >  
>> > +const pass_data pass_data_hardreg_pre =
>> > +{
>> > +  RTL_PASS, /* type */
>> > +  "hardreg_pre", /* name */
>> > +  OPTGROUP_NONE, /* optinfo_flags */
>> > +  TV_PRE, /* tv_id */
>> > +  PROP_cfglayout, /* properties_required */
>> > +  0, /* properties_provided */
>> > +  0, /* properties_destroyed */
>> > +  0, /* todo_flags_start */
>> > +  TODO_df_finish, /* todo_flags_finish */
>> > +};
>> > +
>> > +class pass_hardreg_pre : public rtl_opt_pass
>> > +{
>> > +public:
>> > +  pass_hardreg_pre (gcc::context *ctxt)
>> > +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
>> > +  {}
>> > +
>> > +  /* opt_pass methods: */
>> > +  bool gate (function *) final override;
>> > +  unsigned int execute (function *)  final override
>> > +  {
>> > +    return execute_hardreg_pre ();
>> > +  }
>> > +
>> > +}; // class pass_rtl_pre
>> > +
>> > +bool
>> > +pass_hardreg_pre::gate (function *fun)
>> > +{
>> > +#ifdef HARDREG_PRE_REGNOS
>> > +  return optimize > 0
>> > +    && !fun->calls_setjmp;
>> 
>> Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
>> have expected them to be needed now, since the modern cfg framework
>> should represent setjmp correctly.  Jeff, do you agree?  I'll try
>> removing them and see what breaks...
>
> The problem with setjmp on RTL is that we fail to preserve abnormal
> edges during RTL expansion and fail in the impossible job to recover
> all that are required for correctness (to prevent code motion).

Ah, ok.  I'd wrongly thought that that was a solved problem now.

(Kind-of curious why it's impossible to recover the info.  I guess that's
a bit of a distraction though.  Leaning further into reconstructing
rather than preserving the cfg would be the wrong direction anyway.)

Thanks,
Richard

> With PRE that only operates on hardregs and in particular does not
> introduce alternate register usages or move memory ops there might
> be no issue.
>
> Richard.
>
>> Thanks,
>> Richard
>> 
>> > +#else
>> > +  return false;
>> > +#endif
>> > +}
>> > +
>> > +} // anon namespace
>> > +
>> > +rtl_opt_pass *
>> > +make_pass_hardreg_pre (gcc::context *ctxt)
>> > +{
>> > +  return new pass_hardreg_pre (ctxt);
>> > +}
>> > +
>> > +namespace {
>> > +
>> >  const pass_data pass_data_rtl_hoist =
>> >  {
>> >    RTL_PASS, /* type */
>> > diff --git a/gcc/passes.def b/gcc/passes.def
>> > index 
>> > 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2
>> >  100644
>> > --- a/gcc/passes.def
>> > +++ b/gcc/passes.def
>> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
>> >        NEXT_PASS (pass_rtl_cprop);
>> >        NEXT_PASS (pass_rtl_pre);
>> >        NEXT_PASS (pass_rtl_hoist);
>> > +      NEXT_PASS (pass_hardreg_pre);
>> >        NEXT_PASS (pass_rtl_cprop);
>> >        NEXT_PASS (pass_rtl_store_motion);
>> >        NEXT_PASS (pass_cse_after_global_opts);
>> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
>> > index 
>> > a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f
>> >  100644
>> > --- a/gcc/tree-pass.h
>> > +++ b/gcc/tree-pass.h
>> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context 
>> > *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
>> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
>> 

Reply via email to