Sorry for the slow review. I think Jeff's much better placed to comment on this than I am, but here's a stab. Mostly it looks really good to me FWIW.
Andrew Carlotti <andrew.carlo...@arm.com> writes: > This pass is used to optimise assignments to the FPMR register in > aarch64. I chose to implement this as a middle-end pass because it > mostly reuses the existing RTL PRE code within gcse.cc. > > Compared to RTL PRE, the key difference in this new pass is that we > insert new writes directly to the destination hardreg, instead of > writing to a new pseudo-register and copying the result later. This > requires changes to the analysis portion of the pass, because sets > cannot be moved before existing instructions that set, use or clobber > the hardreg, and the value becomes unavailable after any uses of > clobbers of the hardreg. > > This patch would currently break any debug instructions that use the > value of fpmr in a region of code where that value is changed by this > pass. I haven't worked out the best way to fix this, but I suspect the > issue is uncommon and tricky enough that it would be best to just drop > those debug instructions. Yeah, good question, and pass on that :) Will need to think more about it. > I've bootstrapped and regression tested this on aarch64, and it should be NFC > on other targets. Aside from this, my testing so far has involved hacking in > a > single FP8 intrinsic and testing various parameters and control flow > structures, and checking both the codegen and the LCM bitmaps. I intend to > write better and more comprehensive tests once there are some real intrinsic > implementations available to use. > > > Is this approach good? Apart from fixing the debug instructions and > adding tests, is there anything else I need to change? > > > gcc/ChangeLog: > > * config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro. > * gcse.cc (doing_hardreg_pre_p): New global variable. > (current_hardreg_regno): Ditto. > (compute_local_properties): Unset transp for hardreg clobbers. > (prune_hardreg_uses): New. > (want_to_gcse_p): Always return true for hardreg PRE. > (hash_scan_set): Add checks for hardreg uses/clobbers. > (oprs_unchanged_p): Disable load motion for hardreg PRE pass. > (record_last_mem_set_info): Ditto. > (compute_hash_table_work): Record hardreg uses. > (prune_expressions): Mark hardreg sets as call-clobbered. > (compute_pre_data): Add call to prune_hardreg_uses. > (pre_expr_reaches_here_p_work): Add comment. > (insert_insn_start_basic_block): New functions. > (pre_edge_insert): Don't add hardreg sets to predecessor block. > (pre_delete): Use hardreg for the reaching reg. > (pre_gcse): Don't insert copies for hardreg PRE. > (one_pre_gcse_pass): Disable load motion for hardreg PRE pass. > (execute_hardreg_pre): New. > (class pass_hardreg_pre): New. > (pass_hardreg_pre::gate): New. > (make_pass_hardreg_pre): New. > * passes.def (pass_hardreg_pre): New pass. > * tree-pass.h (make_pass_hardreg_pre): New. > > [...] > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap > *comp, sbitmap *antloc, > } > } > } > + > +/* A hardreg set is not transparent in a block if there are any uses of that > + hardreg. This filters the results of compute_local_properties, after the > + result of that function has been used to define the kills bitmap. I think this is mostly my ignorance of the code, and would be obvious if I tried it out locally, but: why do we need to do this after computing the kills bitmap? For mode-switching, the kills bitmap is the inverse of the transparency bitmap, but it sounds like here you want the kills bitmap to be more selective. > + > + TRANSP is the destination sbitmap to be updated. > + > + TABLE controls which hash table to look at. */ > + > +static void > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table) > +{ > + unsigned int i; > + gcc_assert (doing_hardreg_pre_p); > + > + for (i = 0; i < table->size; i++) > + { > + struct gcse_expr *expr; > + > + for (expr = table->table[i]; expr != NULL; expr = expr->next_same_hash) > + { > + int indx = expr->bitmap_index; > + df_ref def; > + > + for (def = DF_REG_USE_CHAIN (current_hardreg_regno); > + def; > + def = DF_REF_NEXT_REG (def)) > + bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx); > + } > + } > +} > > /* Hash table support. */ > @@ -747,6 +804,9 @@ static basic_block current_bb; > static bool > want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr) > { > + if (doing_hardreg_pre_p) > + return true; > + > #ifdef STACK_REGS > /* On register stack architectures, don't GCSE constants from the > constant pool, as the benefits are often swamped by the overhead > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, bool > avail_p) > } > > case MEM: > - if (! flag_gcse_lm > + if (! flag_gcse_lm || doing_hardreg_pre_p This test occurs often enough that I think it's worth splitting out. Something like: !do_load_motion ()? > || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn), > x, avail_p)) > return false; > [...] > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d > *table) > } > > note_stores (insn, record_last_set_info, insn); > + > + if (doing_hardreg_pre_p && hardreg_last_bb != current_bb) > + { > + /* We need to record the first use of a hardreg to determine if a > + set of that hardreg is anticipatable. */ > + df_ref ref; > + FOR_EACH_INSN_USE (ref, insn) > + if (DF_REF_REGNO (ref) == current_hardreg_regno) > + { > + hardreg_last_bb = current_bb; > + hardreg_first_use = DF_INSN_LUID (insn); > + } > + } > } Couldn't we instead check whether the register is live on entry to the block? That would avoid the extra bit of state. > > /* The next pass builds the hash table. */ > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p) > { > for (expr = expr_hash_table.table[ui]; expr; expr = > expr->next_same_hash) > { > + /* For hardreg pre, we assume that all relevant hardregs are > + call-clobbered, and set all bits in prune_exprs if the reg is call > + clobbered. Not sure I understand this. But... > If the hardreg were merely call-used, then we would > + need to remove the expression from the anticipatable and > + transparent bitmaps only (after using this to compute the kills > + bitmap). */ > + > + if (doing_hardreg_pre_p) > + { > + bitmap_set_bit (prune_exprs, expr->bitmap_index); > + continue; > + } > + ...the effect seems to be to set every bit of prune_exprs, in which case it might be easier to skip this loop entirely and adjust the later one to use bitmap_set_range. > /* Note potentially trapping expressions. */ > if (may_trap_p (expr->expr)) > { > [...] > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void) > return 0; > } > > +static unsigned int > +execute_hardreg_pre (void) > +{ > + doing_hardreg_pre_p = true; > + unsigned int regnos[] = HARDREG_PRE_REGNOS; > + /* It's possible to avoid this loop, but it isn't worth doing so until > + hardreg PRE is used for multiple hardregs. */ Yeah, sounds ok to me. But out of curiosity, how difficult would it be to structure the code so that this just works? Where are the main difficulties? Having to maintain a list of which expressions are associated with which register, and therefore which expressions mutually kill each other? > + for (int i = 0; regnos[i] != 0; i++) > + { > + int changed; > + current_hardreg_regno = regnos[i]; > + if (dump_file) > + fprintf(dump_file, "Entering hardreg PRE for regno %d\n", > + current_hardreg_regno); > + delete_unreachable_blocks (); > + df_analyze (); > + changed = one_pre_gcse_pass (); > + flag_rerun_cse_after_global_opts |= changed; Is this appropriate for the new pass? We're not really exposing general CSE opportunities. > + if (changed) > + cleanup_cfg (0); > + } > + doing_hardreg_pre_p = false; > + return 0; > +} > + > static unsigned int > execute_rtl_hoist (void) > { > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt) > > namespace { > > +const pass_data pass_data_hardreg_pre = > +{ > + RTL_PASS, /* type */ > + "hardreg_pre", /* name */ > + OPTGROUP_NONE, /* optinfo_flags */ > + TV_PRE, /* tv_id */ > + PROP_cfglayout, /* properties_required */ > + 0, /* properties_provided */ > + 0, /* properties_destroyed */ > + 0, /* todo_flags_start */ > + TODO_df_finish, /* todo_flags_finish */ > +}; > + > +class pass_hardreg_pre : public rtl_opt_pass > +{ > +public: > + pass_hardreg_pre (gcc::context *ctxt) > + : rtl_opt_pass (pass_data_hardreg_pre, ctxt) > + {} > + > + /* opt_pass methods: */ > + bool gate (function *) final override; > + unsigned int execute (function *) final override > + { > + return execute_hardreg_pre (); > + } > + > +}; // class pass_rtl_pre > + > +bool > +pass_hardreg_pre::gate (function *fun) > +{ > +#ifdef HARDREG_PRE_REGNOS > + return optimize > 0 > + && !fun->calls_setjmp; Huh. It looks like these setjmp exclusions go back to 1998. I wouldn't have expected them to be needed now, since the modern cfg framework should represent setjmp correctly. Jeff, do you agree? I'll try removing them and see what breaks... Thanks, Richard > +#else > + return false; > +#endif > +} > + > +} // anon namespace > + > +rtl_opt_pass * > +make_pass_hardreg_pre (gcc::context *ctxt) > +{ > + return new pass_hardreg_pre (ctxt); > +} > + > +namespace { > + > const pass_data pass_data_rtl_hoist = > { > RTL_PASS, /* type */ > diff --git a/gcc/passes.def b/gcc/passes.def > index > 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2 > 100644 > --- a/gcc/passes.def > +++ b/gcc/passes.def > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3. If not see > NEXT_PASS (pass_rtl_cprop); > NEXT_PASS (pass_rtl_pre); > NEXT_PASS (pass_rtl_hoist); > + NEXT_PASS (pass_hardreg_pre); > NEXT_PASS (pass_rtl_cprop); > NEXT_PASS (pass_rtl_store_motion); > NEXT_PASS (pass_cse_after_global_opts); > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h > index > a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f > 100644 > --- a/gcc/tree-pass.h > +++ b/gcc/tree-pass.h > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context > *ctxt); > extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);