Sorry for the slow review.  I think Jeff's much better placed to comment
on this than I am, but here's a stab.  Mostly it looks really good to me
FWIW.

Andrew Carlotti <andrew.carlo...@arm.com> writes:
> This pass is used to optimise assignments to the FPMR register in
> aarch64.  I chose to implement this as a middle-end pass because it
> mostly reuses the existing RTL PRE code within gcse.cc.
>
> Compared to RTL PRE, the key difference in this new pass is that we
> insert new writes directly to the destination hardreg, instead of
> writing to a new pseudo-register and copying the result later.  This
> requires changes to the analysis portion of the pass, because sets
> cannot be moved before existing instructions that set, use or clobber
> the hardreg, and the value becomes unavailable after any uses of
> clobbers of the hardreg.
>
> This patch would currently break any debug instructions that use the
> value of fpmr in a region of code where that value is changed by this
> pass.  I haven't worked out the best way to fix this, but I suspect the
> issue is uncommon and tricky enough that it would be best to just drop
> those debug instructions.

Yeah, good question, and pass on that :)  Will need to think more about it.

> I've bootstrapped and regression tested this on aarch64, and it should be NFC
> on other targets.  Aside from this, my testing so far has involved hacking in 
> a
> single FP8 intrinsic and testing various parameters and control flow
> structures, and checking both the codegen and the LCM bitmaps.  I intend to
> write better and more comprehensive tests once there are some real intrinsic
> implementations available to use.
>
>
> Is this approach good?  Apart from fixing the debug instructions and
> adding tests, is there anything else I need to change?
>
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
>       * gcse.cc (doing_hardreg_pre_p): New global variable.
>       (current_hardreg_regno): Ditto.
>       (compute_local_properties): Unset transp for hardreg clobbers.
>       (prune_hardreg_uses): New.
>       (want_to_gcse_p): Always return true for hardreg PRE.
>       (hash_scan_set): Add checks for hardreg uses/clobbers.
>       (oprs_unchanged_p): Disable load motion for hardreg PRE pass.
>       (record_last_mem_set_info): Ditto.
>       (compute_hash_table_work): Record hardreg uses.
>       (prune_expressions): Mark hardreg sets as call-clobbered.
>       (compute_pre_data): Add call to prune_hardreg_uses.
>       (pre_expr_reaches_here_p_work): Add comment.
>       (insert_insn_start_basic_block): New functions.
>       (pre_edge_insert): Don't add hardreg sets to predecessor block.
>       (pre_delete): Use hardreg for the reaching reg.
>       (pre_gcse): Don't insert copies for hardreg PRE.
>       (one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
>       (execute_hardreg_pre): New.
>       (class pass_hardreg_pre): New.
>       (pass_hardreg_pre::gate): New.
>       (make_pass_hardreg_pre): New.
>       * passes.def (pass_hardreg_pre): New pass.
>       * tree-pass.h (make_pass_hardreg_pre): New.
>
> [...]
> @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap 
> *comp, sbitmap *antloc,
>       }
>      }
>  }
> +
> +/* A hardreg set is not transparent in a block if there are any uses of that
> +   hardreg.  This filters the results of compute_local_properties, after the
> +   result of that function has been used to define the kills bitmap.

I think this is mostly my ignorance of the code, and would be obvious
if I tried it out locally, but: why do we need to do this after
computing the kills bitmap?  For mode-switching, the kills bitmap
is the inverse of the transparency bitmap, but it sounds like here
you want the kills bitmap to be more selective.

> +
> +   TRANSP is the destination sbitmap to be updated.
> +
> +   TABLE controls which hash table to look at.  */
> +
> +static void
> +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
> +{
> +  unsigned int i;
> +  gcc_assert (doing_hardreg_pre_p);
> +
> +  for (i = 0; i < table->size; i++)
> +    {
> +      struct gcse_expr *expr;
> +
> +      for (expr = table->table[i]; expr != NULL; expr = expr->next_same_hash)
> +     {
> +       int indx = expr->bitmap_index;
> +       df_ref def;
> +
> +       for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
> +            def;
> +            def = DF_REF_NEXT_REG (def))
> +         bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> +     }
> +    }
> +}
>  
>  /* Hash table support.  */
 

> @@ -747,6 +804,9 @@ static basic_block current_bb;
>  static bool
>  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
>  {
> +  if (doing_hardreg_pre_p)
> +    return true;
> +
>  #ifdef STACK_REGS
>    /* On register stack architectures, don't GCSE constants from the
>       constant pool, as the benefits are often swamped by the overhead
> @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, bool 
> avail_p)
>        }
>  
>      case MEM:
> -      if (! flag_gcse_lm
> +      if (! flag_gcse_lm || doing_hardreg_pre_p

This test occurs often enough that I think it's worth splitting out.
Something like: !do_load_motion ()?

>         || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
>                                    x, avail_p))
>       return false;
> [...]
> @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d 
> *table)
>           }
>  
>         note_stores (insn, record_last_set_info, insn);
> +
> +       if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
> +         {
> +           /* We need to record the first use of a hardreg to determine if a
> +              set of that hardreg is anticipatable.  */
> +           df_ref ref;
> +           FOR_EACH_INSN_USE (ref, insn)
> +             if (DF_REF_REGNO (ref) == current_hardreg_regno)
> +               {
> +                 hardreg_last_bb = current_bb;
> +                 hardreg_first_use = DF_INSN_LUID (insn);
> +               }
> +         }
>       }

Couldn't we instead check whether the register is live on entry to the block?
That would avoid the extra bit of state.

>  
>        /* The next pass builds the hash table.  */
> @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
>      {
>        for (expr = expr_hash_table.table[ui]; expr; expr = 
> expr->next_same_hash)
>       {
> +       /* For hardreg pre, we assume that all relevant hardregs are
> +          call-clobbered, and set all bits in prune_exprs if the reg is call
> +          clobbered.

Not sure I understand this.  But...

>                        If the hardreg were merely call-used, then we would
> +          need to remove the expression from the anticipatable and
> +          transparent bitmaps only (after using this to compute the kills
> +          bitmap).  */
> +
> +       if (doing_hardreg_pre_p)
> +         {
> +           bitmap_set_bit (prune_exprs, expr->bitmap_index);
> +           continue;
> +         }
> +

...the effect seems to be to set every bit of prune_exprs, in which
case it might be easier to skip this loop entirely and adjust the later
one to use bitmap_set_range.

>         /* Note potentially trapping expressions.  */
>         if (may_trap_p (expr->expr))
>           {
> [...]
> @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
>    return 0;
>  }
>  
> +static unsigned int
> +execute_hardreg_pre (void)
> +{
> +  doing_hardreg_pre_p = true;
> +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
> +  /* It's possible to avoid this loop, but it isn't worth doing so until
> +     hardreg PRE is used for multiple hardregs.  */

Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
to structure the code so that this just works?  Where are the main
difficulties?  Having to maintain a list of which expressions are
associated with which register, and therefore which expressions
mutually kill each other?

> +  for (int i = 0; regnos[i] != 0; i++)
> +    {
> +      int changed;
> +      current_hardreg_regno = regnos[i];
> +      if (dump_file)
> +     fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
> +             current_hardreg_regno);
> +      delete_unreachable_blocks ();
> +      df_analyze ();
> +      changed = one_pre_gcse_pass ();
> +      flag_rerun_cse_after_global_opts |= changed;

Is this appropriate for the new pass?  We're not really exposing general
CSE opportunities.

> +      if (changed)
> +     cleanup_cfg (0);
> +    }
> +  doing_hardreg_pre_p = false;
> +  return 0;
> +}
> +
>  static unsigned int
>  execute_rtl_hoist (void)
>  {
> @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
>  
>  namespace {
>  
> +const pass_data pass_data_hardreg_pre =
> +{
> +  RTL_PASS, /* type */
> +  "hardreg_pre", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_PRE, /* tv_id */
> +  PROP_cfglayout, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_hardreg_pre : public rtl_opt_pass
> +{
> +public:
> +  pass_hardreg_pre (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  bool gate (function *) final override;
> +  unsigned int execute (function *)  final override
> +  {
> +    return execute_hardreg_pre ();
> +  }
> +
> +}; // class pass_rtl_pre
> +
> +bool
> +pass_hardreg_pre::gate (function *fun)
> +{
> +#ifdef HARDREG_PRE_REGNOS
> +  return optimize > 0
> +    && !fun->calls_setjmp;

Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
have expected them to be needed now, since the modern cfg framework
should represent setjmp correctly.  Jeff, do you agree?  I'll try
removing them and see what breaks...

Thanks,
Richard

> +#else
> +  return false;
> +#endif
> +}
> +
> +} // anon namespace
> +
> +rtl_opt_pass *
> +make_pass_hardreg_pre (gcc::context *ctxt)
> +{
> +  return new pass_hardreg_pre (ctxt);
> +}
> +
> +namespace {
> +
>  const pass_data pass_data_rtl_hoist =
>  {
>    RTL_PASS, /* type */
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 
> 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2
>  100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_rtl_cprop);
>        NEXT_PASS (pass_rtl_pre);
>        NEXT_PASS (pass_rtl_hoist);
> +      NEXT_PASS (pass_hardreg_pre);
>        NEXT_PASS (pass_rtl_cprop);
>        NEXT_PASS (pass_rtl_store_motion);
>        NEXT_PASS (pass_cse_after_global_opts);
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 
> a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f
>  100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context 
> *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);

Reply via email to