Hi Richard! On 2024-06-28T17:48:30+0100, Richard Sandiford <richard.sandif...@arm.com> wrote: > Richard Sandiford <richard.sandif...@arm.com> writes: >> Thomas Schwinge <tschwi...@baylibre.com> writes: >>> On 2024-06-27T23:20:18+0200, I wrote: >>>> On 2024-06-27T22:27:21+0200, I wrote: >>>>> On 2024-06-27T18:49:17+0200, I wrote: >>>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford >>>>>> <richard.sandif...@arm.com> wrote: >>>>>>> This patch adds a combine pass that runs late in the pipeline. >>>>> >>>>> [After sending, I realized I replied to a previous thread of this work.] >>>>> >>>>>> I've beek looking a bit through recent nvptx target code generation >>>>>> changes for GCC target libraries, and thought I'd also share here my >>>>>> findings for the "late-combine" changes in isolation, for nvptx target. >>>>>> >>>>>> First the unexpected thing: >>>>> >>>>> So much for "unexpected thing" -- next level of unexpected here... >>>>> Appreciated if anyone feels like helping me find my way through this, but >>>>> I totally understand if you've got other things to do. >>>> >>>> OK, I found something already. (Unexpectedly quickly...) ;-) >>>> >>>>>> there are a few cases where we now see unused >>>>>> registers get declared >>> >>>> But in fact, for both cases >>> >>> Now tested: 's%both%all'. :-) >>> >>>> the unexpected difference goes away if after >>>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run >>>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's >>>> all not active for nvptx target given '!reload_completed', given nvptx is >>>> 'targetm.no_register_allocation'. Maybe we need to enable a few more >>>> passes, or is there anything in 'pass_late_combine' to change, so that we >>>> don't run into this? Does it inadvertently mark registers live or >>>> something like that? >>> >>> Basically, is 'pass_late_combine' potentionally doing things that depend >>> on later clean-up? (..., or shouldn't it be doing these things in the >>> first place?) >> >> It's possible that late-combine could expose dead code, but I imagine >> it's a niche case. >> >> I had a look at the nvptx logs from my comparison, and the cases in >> which I saw this seemed to be those where late-combine doesn't find >> anything to do. Does that match your examples? Specifically, >> the effect should be the same with -fdbg-cnt=late_combine:0-0 >> >> I think what's happening is that: >> >> - combine exposes dead code >> >> - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared >> up the dead code >> >> - late-combine instead runs df_analyze without that flag (since late-combine >> itself doesn't really care whether dead code is present) >> >> - if late-combine doesn't do anything, ce2's df_analyze call has nothing >> to do, and skips even the DCE >> >> The easiest fix would be to add: >> >> df_set_flags (DF_LR_RUN_DCE); >> >> before df_analyze in late-combine.cc, so that it behaves like ce2. >> But the arrangement feels wrong. I would have expected DF_LR_RUN_DCE >> to depend on whether df_analyze had been called since the last DCE pass >> (whether DF_LR_RUN_DCE or a full DCE). > > I'm testing the attached patch to do that. I'll submit it properly if > testing passes, but it seems to fix the extra-register problem for me.
> Give fast DCE a separate dirty flag Thanks, and yes, your analysis makes sense to me (to the extent that I only superficially understand these parts of GCC) -- and I confirm that your proposed change to "Give fast DCE a separate dirty flag" does address the issue for nvptx target. Grüße Thomas > Thomas pointed out that we sometimes failed to eliminate some dead code > (specifically clobbers of otherwise unused registers) on nvptx when > late-combine is enabled. This happens because: > > - combine is able to optimise the function in a way that exposes dead code. > This leaves the df information in a "dirty" state. > > - late_combine calls df_analyze without DF_LR_RUN_DCE run set. > This updates the df information and clears the "dirty" state. > > - late_combine doesn't find any extra optimisations, and so leaves > the df information up-to-date. > > - if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set. > Because the df information is already up-to-date, fast DCE is > not run. > > The upshot is that running late-combine has the effect of suppressing > a DCE opportunity that would have been noticed without late_combine. > > I think this shows that we should track the state of the DCE separately > from the LR problem. Every pass updates the latter, but not all passes > update the former. > > gcc/ > * df.h (DF_LR_DCE): New df_problem_id. > (df_lr_dce): New macro. > * df-core.cc (rest_of_handle_df_finish): Check for a null free_fun. > * df-problems.cc (df_lr_finalize): Split out fast DCE handling to... > (df_lr_dce_finalize): ...this new function. > (problem_LR_DCE): New df_problem. > (df_lr_add_problem): Register LR_DCE rather than LR itself. > * dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty. > --- > gcc/dce.cc | 3 ++ > gcc/df-core.cc | 3 +- > gcc/df-problems.cc | 96 ++++++++++++++++++++++++++++++++-------------- > gcc/df.h | 2 + > 4 files changed, 74 insertions(+), 30 deletions(-) > > diff --git a/gcc/dce.cc b/gcc/dce.cc > index be1a2a87732..04e8d98818d 100644 > --- a/gcc/dce.cc > +++ b/gcc/dce.cc > @@ -1182,6 +1182,9 @@ fast_dce (bool word_level) > BITMAP_FREE (processed); > BITMAP_FREE (redo_out); > BITMAP_FREE (all_blocks); > + > + /* Both forms of DCE should make further DCE unnecessary. */ > + df_lr_dce->solutions_dirty = false; > } > > > diff --git a/gcc/df-core.cc b/gcc/df-core.cc > index b0e8a88d433..8fd778a8618 100644 > --- a/gcc/df-core.cc > +++ b/gcc/df-core.cc > @@ -806,7 +806,8 @@ rest_of_handle_df_finish (void) > for (i = 0; i < df->num_problems_defined; i++) > { > struct dataflow *dflow = df->problems_in_order[i]; > - dflow->problem->free_fun (); > + if (dflow->problem->free_fun) > + dflow->problem->free_fun (); > } > > free (df->postorder); > diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc > index 88ee0dd67fc..bfd24bd1e86 100644 > --- a/gcc/df-problems.cc > +++ b/gcc/df-problems.cc > @@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index) > } > > > -/* Run the fast dce as a side effect of building LR. */ > - > static void > -df_lr_finalize (bitmap all_blocks) > +df_lr_finalize (bitmap) > { > df_lr->solutions_dirty = false; > - if (df->changeable_flags & DF_LR_RUN_DCE) > - { > - run_fast_df_dce (); > - > - /* If dce deletes some instructions, we need to recompute the lr > - solution before proceeding further. The problem is that fast > - dce is a pessimestic dataflow algorithm. In the case where > - it deletes a statement S inside of a loop, the uses inside of > - S may not be deleted from the dataflow solution because they > - were carried around the loop. While it is conservatively > - correct to leave these extra bits, the standards of df > - require that we maintain the best possible (least fixed > - point) solution. The only way to do that is to redo the > - iteration from the beginning. See PR35805 for an > - example. */ > - if (df_lr->solutions_dirty) > - { > - df_clear_flags (DF_LR_RUN_DCE); > - df_lr_alloc (all_blocks); > - df_lr_local_compute (all_blocks); > - df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); > - df_lr_finalize (all_blocks); > - df_set_flags (DF_LR_RUN_DCE); > - } > - } > } > > > @@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR = > false /* Reset blocks on dropping out of > blocks_to_analyze. */ > }; > > +/* Run the fast DCE after building LR. This is a separate problem so that > + the "dirty" flag is only cleared after a DCE pass is actually run. */ > + > +static void > +df_lr_dce_finalize (bitmap all_blocks) > +{ > + if (!(df->changeable_flags & DF_LR_RUN_DCE)) > + return; > + > + /* Also clears df_lr_dce->solutions_dirty. */ > + run_fast_df_dce (); > + > + /* If dce deletes some instructions, we need to recompute the lr > + solution before proceeding further. The problem is that fast > + dce is a pessimestic dataflow algorithm. In the case where > + it deletes a statement S inside of a loop, the uses inside of > + S may not be deleted from the dataflow solution because they > + were carried around the loop. While it is conservatively > + correct to leave these extra bits, the standards of df > + require that we maintain the best possible (least fixed > + point) solution. The only way to do that is to redo the > + iteration from the beginning. See PR35805 for an > + example. */ > + if (df_lr->solutions_dirty) > + { > + df_clear_flags (DF_LR_RUN_DCE); > + df_lr_alloc (all_blocks); > + df_lr_local_compute (all_blocks); > + df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); > + df_lr_finalize (all_blocks); > + df_set_flags (DF_LR_RUN_DCE); > + } > +} > + > +static const struct df_problem problem_LR_DCE > +{ > + DF_LR_DCE, /* Problem id. */ > + DF_BACKWARD, /* Direction (arbitrary). */ > + NULL, /* Allocate the problem specific data. */ > + NULL, /* Reset global information. */ > + NULL, /* Free basic block info. */ > + NULL, /* Local compute function. */ > + NULL, /* Init the solution specific data. */ > + NULL, /* Worklist solver. */ > + NULL, /* Confluence operator 0. */ > + NULL, /* Confluence operator n. */ > + NULL, /* Transfer function. */ > + df_lr_dce_finalize, /* Finalize function. */ > + NULL, /* Free all of the problem information. */ > + NULL, /* Remove this problem from the stack of > dataflow problems. */ > + NULL, /* Debugging. */ > + NULL, /* Debugging start block. */ > + NULL, /* Debugging end block. */ > + NULL, /* Debugging start insn. */ > + NULL, /* Debugging end insn. */ > + NULL, /* Incremental solution verify start. */ > + NULL, /* Incremental solution verify end. */ > + &problem_LR, /* Dependent problem. */ > + 0, /* Size of entry of block_info array. */ > + TV_DF_LR, /* Timing variable. */ > + false /* Reset blocks on dropping out of > blocks_to_analyze. */ > +}; > + > > /* Create a new DATAFLOW instance and add it to an existing instance > of DF. The returned structure is what is used to get at the > @@ -1274,7 +1310,9 @@ static const struct df_problem problem_LR = > void > df_lr_add_problem (void) > { > - df_add_problem (&problem_LR); > + /* Also add the fast DCE problem. It is then conditionally enabled by > + the DF_LR_RUN_DCE flag. */ > + df_add_problem (&problem_LR_DCE); > /* These will be initialized when df_scan_blocks processes each > block. */ > df_lr->out_of_date_transfer_functions = BITMAP_ALLOC (&df_bitmap_obstack); > diff --git a/gcc/df.h b/gcc/df.h > index c4e690b40cf..bbb4f297b47 100644 > --- a/gcc/df.h > +++ b/gcc/df.h > @@ -47,6 +47,7 @@ enum df_problem_id > { > DF_SCAN, > DF_LR, /* Live Registers backward. */ > + DF_LR_DCE, /* Dead code elimination post-pass for LR. */ > DF_LIVE, /* Live Registers & Uninitialized Registers */ > DF_RD, /* Reaching Defs. */ > DF_CHAIN, /* Def-Use and/or Use-Def Chains. */ > @@ -940,6 +941,7 @@ extern class df_d *df; > #define df_scan (df->problems_by_index[DF_SCAN]) > #define df_rd (df->problems_by_index[DF_RD]) > #define df_lr (df->problems_by_index[DF_LR]) > +#define df_lr_dce (df->problems_by_index[DF_LR_DCE]) > #define df_live (df->problems_by_index[DF_LIVE]) > #define df_chain (df->problems_by_index[DF_CHAIN]) > #define df_word_lr (df->problems_by_index[DF_WORD_LR]) > -- > 2.25.1