Richard Sandiford <richard.sandif...@arm.com> writes: > Thomas Schwinge <tschwi...@baylibre.com> writes: >> Hi! >> >> On 2024-06-27T23:20:18+0200, I wrote: >>> On 2024-06-27T22:27:21+0200, I wrote: >>>> On 2024-06-27T18:49:17+0200, I wrote: >>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford >>>>> <richard.sandif...@arm.com> wrote: >>>>>> This patch adds a combine pass that runs late in the pipeline. >>>> >>>> [After sending, I realized I replied to a previous thread of this work.] >>>> >>>>> I've beek looking a bit through recent nvptx target code generation >>>>> changes for GCC target libraries, and thought I'd also share here my >>>>> findings for the "late-combine" changes in isolation, for nvptx target. >>>>> >>>>> First the unexpected thing: >>>> >>>> So much for "unexpected thing" -- next level of unexpected here... >>>> Appreciated if anyone feels like helping me find my way through this, but >>>> I totally understand if you've got other things to do. >>> >>> OK, I found something already. (Unexpectedly quickly...) ;-) >>> >>>>> there are a few cases where we now see unused >>>>> registers get declared >> >>> But in fact, for both cases >> >> Now tested: 's%both%all'. :-) >> >>> the unexpected difference goes away if after >>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run >>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's >>> all not active for nvptx target given '!reload_completed', given nvptx is >>> 'targetm.no_register_allocation'. Maybe we need to enable a few more >>> passes, or is there anything in 'pass_late_combine' to change, so that we >>> don't run into this? Does it inadvertently mark registers live or >>> something like that? >> >> Basically, is 'pass_late_combine' potentionally doing things that depend >> on later clean-up? (..., or shouldn't it be doing these things in the >> first place?) > > It's possible that late-combine could expose dead code, but I imagine > it's a niche case. > > I had a look at the nvptx logs from my comparison, and the cases in > which I saw this seemed to be those where late-combine doesn't find > anything to do. Does that match your examples? Specifically, > the effect should be the same with -fdbg-cnt=late_combine:0-0 > > I think what's happening is that: > > - combine exposes dead code > > - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared > up the dead code > > - late-combine instead runs df_analyze without that flag (since late-combine > itself doesn't really care whether dead code is present) > > - if late-combine doesn't do anything, ce2's df_analyze call has nothing > to do, and skips even the DCE > > The easiest fix would be to add: > > df_set_flags (DF_LR_RUN_DCE); > > before df_analyze in late-combine.cc, so that it behaves like ce2. > But the arrangement feels wrong. I would have expected DF_LR_RUN_DCE > to depend on whether df_analyze had been called since the last DCE pass > (whether DF_LR_RUN_DCE or a full DCE).
I'm testing the attached patch to do that. I'll submit it properly if testing passes, but it seems to fix the extra-register problem for me. Thanks, Richard --- Give fast DCE a separate dirty flag Thomas pointed out that we sometimes failed to eliminate some dead code (specifically clobbers of otherwise unused registers) on nvptx when late-combine is enabled. This happens because: - combine is able to optimise the function in a way that exposes dead code. This leaves the df information in a "dirty" state. - late_combine calls df_analyze without DF_LR_RUN_DCE run set. This updates the df information and clears the "dirty" state. - late_combine doesn't find any extra optimisations, and so leaves the df information up-to-date. - if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set. Because the df information is already up-to-date, fast DCE is not run. The upshot is that running late-combine has the effect of suppressing a DCE opportunity that would have been noticed without late_combine. I think this shows that we should track the state of the DCE separately from the LR problem. Every pass updates the latter, but not all passes update the former. gcc/ * df.h (DF_LR_DCE): New df_problem_id. (df_lr_dce): New macro. * df-core.cc (rest_of_handle_df_finish): Check for a null free_fun. * df-problems.cc (df_lr_finalize): Split out fast DCE handling to... (df_lr_dce_finalize): ...this new function. (problem_LR_DCE): New df_problem. (df_lr_add_problem): Register LR_DCE rather than LR itself. * dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty. --- gcc/dce.cc | 3 ++ gcc/df-core.cc | 3 +- gcc/df-problems.cc | 96 ++++++++++++++++++++++++++++++++-------------- gcc/df.h | 2 + 4 files changed, 74 insertions(+), 30 deletions(-) diff --git a/gcc/dce.cc b/gcc/dce.cc index be1a2a87732..04e8d98818d 100644 --- a/gcc/dce.cc +++ b/gcc/dce.cc @@ -1182,6 +1182,9 @@ fast_dce (bool word_level) BITMAP_FREE (processed); BITMAP_FREE (redo_out); BITMAP_FREE (all_blocks); + + /* Both forms of DCE should make further DCE unnecessary. */ + df_lr_dce->solutions_dirty = false; } diff --git a/gcc/df-core.cc b/gcc/df-core.cc index b0e8a88d433..8fd778a8618 100644 --- a/gcc/df-core.cc +++ b/gcc/df-core.cc @@ -806,7 +806,8 @@ rest_of_handle_df_finish (void) for (i = 0; i < df->num_problems_defined; i++) { struct dataflow *dflow = df->problems_in_order[i]; - dflow->problem->free_fun (); + if (dflow->problem->free_fun) + dflow->problem->free_fun (); } free (df->postorder); diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc index 88ee0dd67fc..bfd24bd1e86 100644 --- a/gcc/df-problems.cc +++ b/gcc/df-problems.cc @@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index) } -/* Run the fast dce as a side effect of building LR. */ - static void -df_lr_finalize (bitmap all_blocks) +df_lr_finalize (bitmap) { df_lr->solutions_dirty = false; - if (df->changeable_flags & DF_LR_RUN_DCE) - { - run_fast_df_dce (); - - /* If dce deletes some instructions, we need to recompute the lr - solution before proceeding further. The problem is that fast - dce is a pessimestic dataflow algorithm. In the case where - it deletes a statement S inside of a loop, the uses inside of - S may not be deleted from the dataflow solution because they - were carried around the loop. While it is conservatively - correct to leave these extra bits, the standards of df - require that we maintain the best possible (least fixed - point) solution. The only way to do that is to redo the - iteration from the beginning. See PR35805 for an - example. */ - if (df_lr->solutions_dirty) - { - df_clear_flags (DF_LR_RUN_DCE); - df_lr_alloc (all_blocks); - df_lr_local_compute (all_blocks); - df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); - df_lr_finalize (all_blocks); - df_set_flags (DF_LR_RUN_DCE); - } - } } @@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR = false /* Reset blocks on dropping out of blocks_to_analyze. */ }; +/* Run the fast DCE after building LR. This is a separate problem so that + the "dirty" flag is only cleared after a DCE pass is actually run. */ + +static void +df_lr_dce_finalize (bitmap all_blocks) +{ + if (!(df->changeable_flags & DF_LR_RUN_DCE)) + return; + + /* Also clears df_lr_dce->solutions_dirty. */ + run_fast_df_dce (); + + /* If dce deletes some instructions, we need to recompute the lr + solution before proceeding further. The problem is that fast + dce is a pessimestic dataflow algorithm. In the case where + it deletes a statement S inside of a loop, the uses inside of + S may not be deleted from the dataflow solution because they + were carried around the loop. While it is conservatively + correct to leave these extra bits, the standards of df + require that we maintain the best possible (least fixed + point) solution. The only way to do that is to redo the + iteration from the beginning. See PR35805 for an + example. */ + if (df_lr->solutions_dirty) + { + df_clear_flags (DF_LR_RUN_DCE); + df_lr_alloc (all_blocks); + df_lr_local_compute (all_blocks); + df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); + df_lr_finalize (all_blocks); + df_set_flags (DF_LR_RUN_DCE); + } +} + +static const struct df_problem problem_LR_DCE +{ + DF_LR_DCE, /* Problem id. */ + DF_BACKWARD, /* Direction (arbitrary). */ + NULL, /* Allocate the problem specific data. */ + NULL, /* Reset global information. */ + NULL, /* Free basic block info. */ + NULL, /* Local compute function. */ + NULL, /* Init the solution specific data. */ + NULL, /* Worklist solver. */ + NULL, /* Confluence operator 0. */ + NULL, /* Confluence operator n. */ + NULL, /* Transfer function. */ + df_lr_dce_finalize, /* Finalize function. */ + NULL, /* Free all of the problem information. */ + NULL, /* Remove this problem from the stack of dataflow problems. */ + NULL, /* Debugging. */ + NULL, /* Debugging start block. */ + NULL, /* Debugging end block. */ + NULL, /* Debugging start insn. */ + NULL, /* Debugging end insn. */ + NULL, /* Incremental solution verify start. */ + NULL, /* Incremental solution verify end. */ + &problem_LR, /* Dependent problem. */ + 0, /* Size of entry of block_info array. */ + TV_DF_LR, /* Timing variable. */ + false /* Reset blocks on dropping out of blocks_to_analyze. */ +}; + /* Create a new DATAFLOW instance and add it to an existing instance of DF. The returned structure is what is used to get at the @@ -1274,7 +1310,9 @@ static const struct df_problem problem_LR = void df_lr_add_problem (void) { - df_add_problem (&problem_LR); + /* Also add the fast DCE problem. It is then conditionally enabled by + the DF_LR_RUN_DCE flag. */ + df_add_problem (&problem_LR_DCE); /* These will be initialized when df_scan_blocks processes each block. */ df_lr->out_of_date_transfer_functions = BITMAP_ALLOC (&df_bitmap_obstack); diff --git a/gcc/df.h b/gcc/df.h index c4e690b40cf..bbb4f297b47 100644 --- a/gcc/df.h +++ b/gcc/df.h @@ -47,6 +47,7 @@ enum df_problem_id { DF_SCAN, DF_LR, /* Live Registers backward. */ + DF_LR_DCE, /* Dead code elimination post-pass for LR. */ DF_LIVE, /* Live Registers & Uninitialized Registers */ DF_RD, /* Reaching Defs. */ DF_CHAIN, /* Def-Use and/or Use-Def Chains. */ @@ -940,6 +941,7 @@ extern class df_d *df; #define df_scan (df->problems_by_index[DF_SCAN]) #define df_rd (df->problems_by_index[DF_RD]) #define df_lr (df->problems_by_index[DF_LR]) +#define df_lr_dce (df->problems_by_index[DF_LR_DCE]) #define df_live (df->problems_by_index[DF_LIVE]) #define df_chain (df->problems_by_index[DF_CHAIN]) #define df_word_lr (df->problems_by_index[DF_WORD_LR]) -- 2.25.1