Richard Biener <rguent...@suse.de> writes: > On Tue, 12 Nov 2024, Richard Sandiford wrote: > >> Sorry for the slow review. I think Jeff's much better placed to comment >> on this than I am, but here's a stab. Mostly it looks really good to me >> FWIW. >> >> Andrew Carlotti <andrew.carlo...@arm.com> writes: >> > This pass is used to optimise assignments to the FPMR register in >> > aarch64. I chose to implement this as a middle-end pass because it >> > mostly reuses the existing RTL PRE code within gcse.cc. >> > >> > Compared to RTL PRE, the key difference in this new pass is that we >> > insert new writes directly to the destination hardreg, instead of >> > writing to a new pseudo-register and copying the result later. This >> > requires changes to the analysis portion of the pass, because sets >> > cannot be moved before existing instructions that set, use or clobber >> > the hardreg, and the value becomes unavailable after any uses of >> > clobbers of the hardreg. >> > >> > This patch would currently break any debug instructions that use the >> > value of fpmr in a region of code where that value is changed by this >> > pass. I haven't worked out the best way to fix this, but I suspect the >> > issue is uncommon and tricky enough that it would be best to just drop >> > those debug instructions. >> >> Yeah, good question, and pass on that :) Will need to think more about it. >> >> > I've bootstrapped and regression tested this on aarch64, and it should be >> > NFC >> > on other targets. Aside from this, my testing so far has involved hacking >> > in a >> > single FP8 intrinsic and testing various parameters and control flow >> > structures, and checking both the codegen and the LCM bitmaps. I intend to >> > write better and more comprehensive tests once there are some real >> > intrinsic >> > implementations available to use. >> > >> > >> > Is this approach good? Apart from fixing the debug instructions and >> > adding tests, is there anything else I need to change? >> > >> > >> > gcc/ChangeLog: >> > >> > * config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro. >> > * gcse.cc (doing_hardreg_pre_p): New global variable. >> > (current_hardreg_regno): Ditto. >> > (compute_local_properties): Unset transp for hardreg clobbers. >> > (prune_hardreg_uses): New. >> > (want_to_gcse_p): Always return true for hardreg PRE. >> > (hash_scan_set): Add checks for hardreg uses/clobbers. >> > (oprs_unchanged_p): Disable load motion for hardreg PRE pass. >> > (record_last_mem_set_info): Ditto. >> > (compute_hash_table_work): Record hardreg uses. >> > (prune_expressions): Mark hardreg sets as call-clobbered. >> > (compute_pre_data): Add call to prune_hardreg_uses. >> > (pre_expr_reaches_here_p_work): Add comment. >> > (insert_insn_start_basic_block): New functions. >> > (pre_edge_insert): Don't add hardreg sets to predecessor block. >> > (pre_delete): Use hardreg for the reaching reg. >> > (pre_gcse): Don't insert copies for hardreg PRE. >> > (one_pre_gcse_pass): Disable load motion for hardreg PRE pass. >> > (execute_hardreg_pre): New. >> > (class pass_hardreg_pre): New. >> > (pass_hardreg_pre::gate): New. >> > (make_pass_hardreg_pre): New. >> > * passes.def (pass_hardreg_pre): New pass. >> > * tree-pass.h (make_pass_hardreg_pre): New. >> > >> > [...] >> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap >> > *comp, sbitmap *antloc, >> > } >> > } >> > } >> > + >> > +/* A hardreg set is not transparent in a block if there are any uses of >> > that >> > + hardreg. This filters the results of compute_local_properties, after >> > the >> > + result of that function has been used to define the kills bitmap. >> >> I think this is mostly my ignorance of the code, and would be obvious >> if I tried it out locally, but: why do we need to do this after >> computing the kills bitmap? For mode-switching, the kills bitmap >> is the inverse of the transparency bitmap, but it sounds like here >> you want the kills bitmap to be more selective. >> >> > + >> > + TRANSP is the destination sbitmap to be updated. >> > + >> > + TABLE controls which hash table to look at. */ >> > + >> > +static void >> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table) >> > +{ >> > + unsigned int i; >> > + gcc_assert (doing_hardreg_pre_p); >> > + >> > + for (i = 0; i < table->size; i++) >> > + { >> > + struct gcse_expr *expr; >> > + >> > + for (expr = table->table[i]; expr != NULL; expr = >> > expr->next_same_hash) >> > + { >> > + int indx = expr->bitmap_index; >> > + df_ref def; >> > + >> > + for (def = DF_REG_USE_CHAIN (current_hardreg_regno); >> > + def; >> > + def = DF_REF_NEXT_REG (def)) >> > + bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx); >> > + } >> > + } >> > +} >> > >> > /* Hash table support. */ >> >> >> > @@ -747,6 +804,9 @@ static basic_block current_bb; >> > static bool >> > want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr) >> > { >> > + if (doing_hardreg_pre_p) >> > + return true; >> > + >> > #ifdef STACK_REGS >> > /* On register stack architectures, don't GCSE constants from the >> > constant pool, as the benefits are often swamped by the overhead >> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, >> > bool avail_p) >> > } >> > >> > case MEM: >> > - if (! flag_gcse_lm >> > + if (! flag_gcse_lm || doing_hardreg_pre_p >> >> This test occurs often enough that I think it's worth splitting out. >> Something like: !do_load_motion ()? >> >> > || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn), >> > x, avail_p)) >> > return false; >> > [...] >> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d >> > *table) >> > } >> > >> > note_stores (insn, record_last_set_info, insn); >> > + >> > + if (doing_hardreg_pre_p && hardreg_last_bb != current_bb) >> > + { >> > + /* We need to record the first use of a hardreg to determine if a >> > + set of that hardreg is anticipatable. */ >> > + df_ref ref; >> > + FOR_EACH_INSN_USE (ref, insn) >> > + if (DF_REF_REGNO (ref) == current_hardreg_regno) >> > + { >> > + hardreg_last_bb = current_bb; >> > + hardreg_first_use = DF_INSN_LUID (insn); >> > + } >> > + } >> > } >> >> Couldn't we instead check whether the register is live on entry to the block? >> That would avoid the extra bit of state. >> >> > >> > /* The next pass builds the hash table. */ >> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p) >> > { >> > for (expr = expr_hash_table.table[ui]; expr; expr = >> > expr->next_same_hash) >> > { >> > + /* For hardreg pre, we assume that all relevant hardregs are >> > + call-clobbered, and set all bits in prune_exprs if the reg is call >> > + clobbered. >> >> Not sure I understand this. But... >> >> > If the hardreg were merely call-used, then we would >> > + need to remove the expression from the anticipatable and >> > + transparent bitmaps only (after using this to compute the kills >> > + bitmap). */ >> > + >> > + if (doing_hardreg_pre_p) >> > + { >> > + bitmap_set_bit (prune_exprs, expr->bitmap_index); >> > + continue; >> > + } >> > + >> >> ...the effect seems to be to set every bit of prune_exprs, in which >> case it might be easier to skip this loop entirely and adjust the later >> one to use bitmap_set_range. >> >> > /* Note potentially trapping expressions. */ >> > if (may_trap_p (expr->expr)) >> > { >> > [...] >> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void) >> > return 0; >> > } >> > >> > +static unsigned int >> > +execute_hardreg_pre (void) >> > +{ >> > + doing_hardreg_pre_p = true; >> > + unsigned int regnos[] = HARDREG_PRE_REGNOS; >> > + /* It's possible to avoid this loop, but it isn't worth doing so until >> > + hardreg PRE is used for multiple hardregs. */ >> >> Yeah, sounds ok to me. But out of curiosity, how difficult would it be >> to structure the code so that this just works? Where are the main >> difficulties? Having to maintain a list of which expressions are >> associated with which register, and therefore which expressions >> mutually kill each other? >> >> > + for (int i = 0; regnos[i] != 0; i++) >> > + { >> > + int changed; >> > + current_hardreg_regno = regnos[i]; >> > + if (dump_file) >> > + fprintf(dump_file, "Entering hardreg PRE for regno %d\n", >> > + current_hardreg_regno); >> > + delete_unreachable_blocks (); >> > + df_analyze (); >> > + changed = one_pre_gcse_pass (); >> > + flag_rerun_cse_after_global_opts |= changed; >> >> Is this appropriate for the new pass? We're not really exposing general >> CSE opportunities. >> >> > + if (changed) >> > + cleanup_cfg (0); >> > + } >> > + doing_hardreg_pre_p = false; >> > + return 0; >> > +} >> > + >> > static unsigned int >> > execute_rtl_hoist (void) >> > { >> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt) >> > >> > namespace { >> > >> > +const pass_data pass_data_hardreg_pre = >> > +{ >> > + RTL_PASS, /* type */ >> > + "hardreg_pre", /* name */ >> > + OPTGROUP_NONE, /* optinfo_flags */ >> > + TV_PRE, /* tv_id */ >> > + PROP_cfglayout, /* properties_required */ >> > + 0, /* properties_provided */ >> > + 0, /* properties_destroyed */ >> > + 0, /* todo_flags_start */ >> > + TODO_df_finish, /* todo_flags_finish */ >> > +}; >> > + >> > +class pass_hardreg_pre : public rtl_opt_pass >> > +{ >> > +public: >> > + pass_hardreg_pre (gcc::context *ctxt) >> > + : rtl_opt_pass (pass_data_hardreg_pre, ctxt) >> > + {} >> > + >> > + /* opt_pass methods: */ >> > + bool gate (function *) final override; >> > + unsigned int execute (function *) final override >> > + { >> > + return execute_hardreg_pre (); >> > + } >> > + >> > +}; // class pass_rtl_pre >> > + >> > +bool >> > +pass_hardreg_pre::gate (function *fun) >> > +{ >> > +#ifdef HARDREG_PRE_REGNOS >> > + return optimize > 0 >> > + && !fun->calls_setjmp; >> >> Huh. It looks like these setjmp exclusions go back to 1998. I wouldn't >> have expected them to be needed now, since the modern cfg framework >> should represent setjmp correctly. Jeff, do you agree? I'll try >> removing them and see what breaks... > > The problem with setjmp on RTL is that we fail to preserve abnormal > edges during RTL expansion and fail in the impossible job to recover > all that are required for correctness (to prevent code motion).
Ah, ok. I'd wrongly thought that that was a solved problem now. (Kind-of curious why it's impossible to recover the info. I guess that's a bit of a distraction though. Leaning further into reconstructing rather than preserving the cfg would be the wrong direction anyway.) Thanks, Richard > With PRE that only operates on hardregs and in particular does not > introduce alternate register usages or move memory ops there might > be no issue. > > Richard. > >> Thanks, >> Richard >> >> > +#else >> > + return false; >> > +#endif >> > +} >> > + >> > +} // anon namespace >> > + >> > +rtl_opt_pass * >> > +make_pass_hardreg_pre (gcc::context *ctxt) >> > +{ >> > + return new pass_hardreg_pre (ctxt); >> > +} >> > + >> > +namespace { >> > + >> > const pass_data pass_data_rtl_hoist = >> > { >> > RTL_PASS, /* type */ >> > diff --git a/gcc/passes.def b/gcc/passes.def >> > index >> > 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2 >> > 100644 >> > --- a/gcc/passes.def >> > +++ b/gcc/passes.def >> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3. If not see >> > NEXT_PASS (pass_rtl_cprop); >> > NEXT_PASS (pass_rtl_pre); >> > NEXT_PASS (pass_rtl_hoist); >> > + NEXT_PASS (pass_hardreg_pre); >> > NEXT_PASS (pass_rtl_cprop); >> > NEXT_PASS (pass_rtl_store_motion); >> > NEXT_PASS (pass_cse_after_global_opts); >> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h >> > index >> > a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f >> > 100644 >> > --- a/gcc/tree-pass.h >> > +++ b/gcc/tree-pass.h >> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context >> > *ctxt); >> > extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt); >> > extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt); >> > extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt); >> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt); >> > extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt); >> > extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt); >> > extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt); >>