On Tue, 12 Nov 2024, Richard Sandiford wrote:

> Sorry for the slow review.  I think Jeff's much better placed to comment
> on this than I am, but here's a stab.  Mostly it looks really good to me
> FWIW.
> 
> Andrew Carlotti <andrew.carlo...@arm.com> writes:
> > This pass is used to optimise assignments to the FPMR register in
> > aarch64.  I chose to implement this as a middle-end pass because it
> > mostly reuses the existing RTL PRE code within gcse.cc.
> >
> > Compared to RTL PRE, the key difference in this new pass is that we
> > insert new writes directly to the destination hardreg, instead of
> > writing to a new pseudo-register and copying the result later.  This
> > requires changes to the analysis portion of the pass, because sets
> > cannot be moved before existing instructions that set, use or clobber
> > the hardreg, and the value becomes unavailable after any uses of
> > clobbers of the hardreg.
> >
> > This patch would currently break any debug instructions that use the
> > value of fpmr in a region of code where that value is changed by this
> > pass.  I haven't worked out the best way to fix this, but I suspect the
> > issue is uncommon and tricky enough that it would be best to just drop
> > those debug instructions.
> 
> Yeah, good question, and pass on that :)  Will need to think more about it.
> 
> > I've bootstrapped and regression tested this on aarch64, and it should be 
> > NFC
> > on other targets.  Aside from this, my testing so far has involved hacking 
> > in a
> > single FP8 intrinsic and testing various parameters and control flow
> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
> > write better and more comprehensive tests once there are some real intrinsic
> > implementations available to use.
> >
> >
> > Is this approach good?  Apart from fixing the debug instructions and
> > adding tests, is there anything else I need to change?
> >
> >
> > gcc/ChangeLog:
> >
> >     * config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> >     * gcse.cc (doing_hardreg_pre_p): New global variable.
> >     (current_hardreg_regno): Ditto.
> >     (compute_local_properties): Unset transp for hardreg clobbers.
> >     (prune_hardreg_uses): New.
> >     (want_to_gcse_p): Always return true for hardreg PRE.
> >     (hash_scan_set): Add checks for hardreg uses/clobbers.
> >     (oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> >     (record_last_mem_set_info): Ditto.
> >     (compute_hash_table_work): Record hardreg uses.
> >     (prune_expressions): Mark hardreg sets as call-clobbered.
> >     (compute_pre_data): Add call to prune_hardreg_uses.
> >     (pre_expr_reaches_here_p_work): Add comment.
> >     (insert_insn_start_basic_block): New functions.
> >     (pre_edge_insert): Don't add hardreg sets to predecessor block.
> >     (pre_delete): Use hardreg for the reaching reg.
> >     (pre_gcse): Don't insert copies for hardreg PRE.
> >     (one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> >     (execute_hardreg_pre): New.
> >     (class pass_hardreg_pre): New.
> >     (pass_hardreg_pre::gate): New.
> >     (make_pass_hardreg_pre): New.
> >     * passes.def (pass_hardreg_pre): New pass.
> >     * tree-pass.h (make_pass_hardreg_pre): New.
> >
> > [...]
> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap 
> > *comp, sbitmap *antloc,
> >     }
> >      }
> >  }
> > +
> > +/* A hardreg set is not transparent in a block if there are any uses of 
> > that
> > +   hardreg.  This filters the results of compute_local_properties, after 
> > the
> > +   result of that function has been used to define the kills bitmap.
> 
> I think this is mostly my ignorance of the code, and would be obvious
> if I tried it out locally, but: why do we need to do this after
> computing the kills bitmap?  For mode-switching, the kills bitmap
> is the inverse of the transparency bitmap, but it sounds like here
> you want the kills bitmap to be more selective.
> 
> > +
> > +   TRANSP is the destination sbitmap to be updated.
> > +
> > +   TABLE controls which hash table to look at.  */
> > +
> > +static void
> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
> > +{
> > +  unsigned int i;
> > +  gcc_assert (doing_hardreg_pre_p);
> > +
> > +  for (i = 0; i < table->size; i++)
> > +    {
> > +      struct gcse_expr *expr;
> > +
> > +      for (expr = table->table[i]; expr != NULL; expr = 
> > expr->next_same_hash)
> > +   {
> > +     int indx = expr->bitmap_index;
> > +     df_ref def;
> > +
> > +     for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
> > +          def;
> > +          def = DF_REF_NEXT_REG (def))
> > +       bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> > +   }
> > +    }
> > +}
> >  
> >  /* Hash table support.  */
>  
> 
> > @@ -747,6 +804,9 @@ static basic_block current_bb;
> >  static bool
> >  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
> >  {
> > +  if (doing_hardreg_pre_p)
> > +    return true;
> > +
> >  #ifdef STACK_REGS
> >    /* On register stack architectures, don't GCSE constants from the
> >       constant pool, as the benefits are often swamped by the overhead
> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, 
> > bool avail_p)
> >        }
> >  
> >      case MEM:
> > -      if (! flag_gcse_lm
> > +      if (! flag_gcse_lm || doing_hardreg_pre_p
> 
> This test occurs often enough that I think it's worth splitting out.
> Something like: !do_load_motion ()?
> 
> >       || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
> >                                  x, avail_p))
> >     return false;
> > [...]
> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d 
> > *table)
> >         }
> >  
> >       note_stores (insn, record_last_set_info, insn);
> > +
> > +     if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
> > +       {
> > +         /* We need to record the first use of a hardreg to determine if a
> > +            set of that hardreg is anticipatable.  */
> > +         df_ref ref;
> > +         FOR_EACH_INSN_USE (ref, insn)
> > +           if (DF_REF_REGNO (ref) == current_hardreg_regno)
> > +             {
> > +               hardreg_last_bb = current_bb;
> > +               hardreg_first_use = DF_INSN_LUID (insn);
> > +             }
> > +       }
> >     }
> 
> Couldn't we instead check whether the register is live on entry to the block?
> That would avoid the extra bit of state.
> 
> >  
> >        /* The next pass builds the hash table.  */
> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
> >      {
> >        for (expr = expr_hash_table.table[ui]; expr; expr = 
> > expr->next_same_hash)
> >     {
> > +     /* For hardreg pre, we assume that all relevant hardregs are
> > +        call-clobbered, and set all bits in prune_exprs if the reg is call
> > +        clobbered.
> 
> Not sure I understand this.  But...
> 
> >                        If the hardreg were merely call-used, then we would
> > +        need to remove the expression from the anticipatable and
> > +        transparent bitmaps only (after using this to compute the kills
> > +        bitmap).  */
> > +
> > +     if (doing_hardreg_pre_p)
> > +       {
> > +         bitmap_set_bit (prune_exprs, expr->bitmap_index);
> > +         continue;
> > +       }
> > +
> 
> ...the effect seems to be to set every bit of prune_exprs, in which
> case it might be easier to skip this loop entirely and adjust the later
> one to use bitmap_set_range.
> 
> >       /* Note potentially trapping expressions.  */
> >       if (may_trap_p (expr->expr))
> >         {
> > [...]
> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
> >    return 0;
> >  }
> >  
> > +static unsigned int
> > +execute_hardreg_pre (void)
> > +{
> > +  doing_hardreg_pre_p = true;
> > +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
> > +  /* It's possible to avoid this loop, but it isn't worth doing so until
> > +     hardreg PRE is used for multiple hardregs.  */
> 
> Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
> to structure the code so that this just works?  Where are the main
> difficulties?  Having to maintain a list of which expressions are
> associated with which register, and therefore which expressions
> mutually kill each other?
> 
> > +  for (int i = 0; regnos[i] != 0; i++)
> > +    {
> > +      int changed;
> > +      current_hardreg_regno = regnos[i];
> > +      if (dump_file)
> > +   fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
> > +           current_hardreg_regno);
> > +      delete_unreachable_blocks ();
> > +      df_analyze ();
> > +      changed = one_pre_gcse_pass ();
> > +      flag_rerun_cse_after_global_opts |= changed;
> 
> Is this appropriate for the new pass?  We're not really exposing general
> CSE opportunities.
> 
> > +      if (changed)
> > +   cleanup_cfg (0);
> > +    }
> > +  doing_hardreg_pre_p = false;
> > +  return 0;
> > +}
> > +
> >  static unsigned int
> >  execute_rtl_hoist (void)
> >  {
> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
> >  
> >  namespace {
> >  
> > +const pass_data pass_data_hardreg_pre =
> > +{
> > +  RTL_PASS, /* type */
> > +  "hardreg_pre", /* name */
> > +  OPTGROUP_NONE, /* optinfo_flags */
> > +  TV_PRE, /* tv_id */
> > +  PROP_cfglayout, /* properties_required */
> > +  0, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  TODO_df_finish, /* todo_flags_finish */
> > +};
> > +
> > +class pass_hardreg_pre : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_hardreg_pre (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  bool gate (function *) final override;
> > +  unsigned int execute (function *)  final override
> > +  {
> > +    return execute_hardreg_pre ();
> > +  }
> > +
> > +}; // class pass_rtl_pre
> > +
> > +bool
> > +pass_hardreg_pre::gate (function *fun)
> > +{
> > +#ifdef HARDREG_PRE_REGNOS
> > +  return optimize > 0
> > +    && !fun->calls_setjmp;
> 
> Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
> have expected them to be needed now, since the modern cfg framework
> should represent setjmp correctly.  Jeff, do you agree?  I'll try
> removing them and see what breaks...

The problem with setjmp on RTL is that we fail to preserve abnormal
edges during RTL expansion and fail in the impossible job to recover
all that are required for correctness (to prevent code motion).
With PRE that only operates on hardregs and in particular does not
introduce alternate register usages or move memory ops there might
be no issue.

Richard.

> Thanks,
> Richard
> 
> > +#else
> > +  return false;
> > +#endif
> > +}
> > +
> > +} // anon namespace
> > +
> > +rtl_opt_pass *
> > +make_pass_hardreg_pre (gcc::context *ctxt)
> > +{
> > +  return new pass_hardreg_pre (ctxt);
> > +}
> > +
> > +namespace {
> > +
> >  const pass_data pass_data_rtl_hoist =
> >  {
> >    RTL_PASS, /* type */
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index 
> > 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2
> >  100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
> >        NEXT_PASS (pass_rtl_cprop);
> >        NEXT_PASS (pass_rtl_pre);
> >        NEXT_PASS (pass_rtl_hoist);
> > +      NEXT_PASS (pass_hardreg_pre);
> >        NEXT_PASS (pass_rtl_cprop);
> >        NEXT_PASS (pass_rtl_store_motion);
> >        NEXT_PASS (pass_cse_after_global_opts);
> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> > index 
> > a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f
> >  100644
> > --- a/gcc/tree-pass.h
> > +++ b/gcc/tree-pass.h
> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context 
> > *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to