Hi Richard, Here's a revised version of my patch incorporating both your suggestions. The algorithm now uses two worklist vectors, and pointers to them, alternating between them on each iteration, which allows the code to handle an arbitrary number of passes without the previous code duplication. This then allows a "for (pass=2; pass<=passes; pass++)"-like idiom, that allows it to try two passes (the current behaviour) at -O1, but three passes with -O2 and higher. This also provides a convenient point, should someone wish to control the number of cprop_hardreg pass iterations from the command line (or a backend) in future.
This patch has been retested on x86_64-pc-linux-gnu, both with an without the proposed i386 backend patch for reduced register shuffling, with make bootstrap and make -k check, both with/without target_board= unix{-m32}, with no new failures. Is this what you had in mind? 2022-06-03 Roger Sayle <ro...@nextmovesoftware.com> Richard Biener <rguent...@suse.de> gcc/ChangeLog * regcprop.cc (pass_cprop_hardreg::execute): Perform a third iteration over each basic block that was updated by the second iteration. Cheers, Roger -- > -----Original Message----- > From: Richard Biener <richard.guent...@gmail.com> > Sent: 02 June 2022 11:56 > To: Roger Sayle <ro...@nextmovesoftware.com> > Cc: GCC Patches <gcc-patches@gcc.gnu.org> > Subject: Re: [PATCH/RFC] cprop_hardreg... Third time's a charm. > > On Thu, Jun 2, 2022 at 12:20 PM Roger Sayle <ro...@nextmovesoftware.com> > wrote: > > > > > > > > This middle-end patch proposes the "hard register constant propagation" > > > > pass be performed up to three times on each basic block (up from the > > > > current two times) if the second pass successfully made changes. > > > > > > > > The motivation for three passes is to handle the "swap idiom" (i.e. > > > > t = x; x = y; y = t;" sequences) that get generated by register > > allocation > > > > (reload). > > > > > > > > Consider the x86_64 test case for __int128 addition recently discussed > > > > on gcc-patches. With that proposed patch, the input to the > > cprop_hardreg > > > > pass looks like: > > > > > > > > movq %rdi, %r8 > > > > movq %rsi, %rdi > > > > movq %r8, %rsi > > > > movq %rdx, %rax > > > > movq %rcx, %rdx > > > > addq %rsi %rax > > > > adcq %rdi, %rdx > > > > ret > > > > > > > > where the first three instructions effectively swap %rsi and %rdi. > > > > > > > > On the first pass of cprop_hardreg, we notice that the third insn, > > > > %rsi := %r8, is redundant and can eliminated/propagated to produce: > > > > > > > > movq %rdi, %r8 > > > > movq %rsi, %rdi > > > > movq %rdx, %rax > > > > movq %rcx, %rdx > > > > addq %r8 %rax > > > > adcq %rdi, %rdx > > > > ret > > > > > > > > Because a successful propagation was found, cprop_hardreg then runs > > > > a second pass/sweep on affected basic blocks (using worklist), and > > > > on this second pass notices that the second instruction, %rdi := %rsi, > > > > may now be propagated (%rsi was killed in the before the first > > transform), > > > > and after a second pass, we now end up with: > > > > > > > > movq %rdi, %r8 > > > > movq %rdx, %rax > > > > movq %rcx, %rdx > > > > addq %r8, %rax > > > > adcq %rsi, %rdx > > > > ret > > > > > > > > which is the current behaviour on mainline. However, a third and > > final > > > > pass would now notice that the first insn, "%r8 := %rdi" is also now > > > > eliminable, and a third iteration would produce optimal code: > > > > > > > > movq %rdx, %rax > > > > movq %rcx, %rdx > > > > addq %rdi, %rax > > > > adcq %rsi, %rdx > > > > ret > > > > > > > > The patch below creates an additional worklist, third_pass, that is > > > > populated with the basic block id's of blocks that were updated during > > > > the second pass. Does the motivation for three passes (reload doesn't > > > > generate more or less than three instructions to swap a pair of > > registers) > > > > seem reasonable for all targets? If cprop_hardreg is considered an > > > > expensive pass, this change could be gated based on basic block count > > > > or similar. Finally, I should point out that this a regression fix; > > > > GCC 4.8 generated optimal code with two moves (whereas GCC 12 required > > > > 5 moves, up from GCC 11's 4 moves). > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > > > and make -k check, both with and without --target_board=unix{-m32} > > with > > > > no new failures. Thoughts? Ok for mainline? > > Can you instead refactor the code to have > > auto_vec<int> worklist1, worklist2; > vec<int> *worklist = &worklist1; > > and alternate between worklist1 and worklist2 N times (const int N = 2 is OK > for > now I guess, OTOH for -O1 we might want to stick to 1). > > it might be tempting to maintain the number of times we visited a block and > just > directly iterate on changed BBs. There's also the comment in > cprop_hardreg_bb: > > /* If a block has a single predecessor, that we've already > processed, begin with the value data that was live at > the end of the predecessor block. */ > /* ??? Ought to use more intelligent queuing of blocks. */ > > which suggests we want to improve on its dataflow (like iterating in RPO order > so we can intersect the live value data of predecessors?) That said, calling > df_analyze () between iterations is what makes iterating expensive I guess, as > well as cprop_hardreg_debug. > > Richard. > > > > > > > > > > > 2022-06-02 Roger Sayle <ro...@nextmovesoftware.com> > > > > > > > > gcc/ChangeLog > > > > * regcprop.cc (pass_cprop_hardreg::execute): Perform a third > > > > iteration over each basic block that was updated by the second > > > > iteration. > > > > > > > > > > > > Thanks in advance, > > > > Roger > > > > -- > > > > > >
diff --git a/gcc/regcprop.cc b/gcc/regcprop.cc index 1fdc367..eacc59f 100644 --- a/gcc/regcprop.cc +++ b/gcc/regcprop.cc @@ -1383,7 +1383,9 @@ pass_cprop_hardreg::execute (function *fun) auto_sbitmap visited (last_basic_block_for_fn (fun)); bitmap_clear (visited); - auto_vec<int> worklist; + auto_vec<int> worklist1, worklist2; + auto_vec<int> *curr = &worklist1; + auto_vec<int> *next = &worklist2; bool any_debug_changes = false; /* We need accurate notes. Earlier passes such as if-conversion may @@ -1404,7 +1406,7 @@ pass_cprop_hardreg::execute (function *fun) FOR_EACH_BB_FN (bb, fun) { if (cprop_hardreg_bb (bb, all_vd, visited)) - worklist.safe_push (bb->index); + curr->safe_push (bb->index); if (all_vd[bb->index].n_debug_insn_changes) any_debug_changes = true; } @@ -1416,16 +1418,22 @@ pass_cprop_hardreg::execute (function *fun) if (MAY_HAVE_DEBUG_BIND_INSNS && any_debug_changes) cprop_hardreg_debug (fun, all_vd); - /* Second pass if we've changed anything, only for the bbs where we have - changed anything though. */ - if (!worklist.is_empty ()) + /* Repeat pass up to PASSES times, but only processing basic blocks + that have changed on the previous iteration. CURR points to the + current worklist, and each iteration populates the NEXT worklist, + swapping pointers after each cycle. */ + + unsigned int passes = optimize > 1 ? 3 : 2; + for (unsigned int pass = 2; pass <= passes && !curr->is_empty (); pass++) { any_debug_changes = false; bitmap_clear (visited); - for (int index : worklist) + next->truncate (0); + for (int index : *curr) { bb = BASIC_BLOCK_FOR_FN (fun, index); - cprop_hardreg_bb (bb, all_vd, visited); + if (cprop_hardreg_bb (bb, all_vd, visited)) + next->safe_push (bb->index); if (all_vd[bb->index].n_debug_insn_changes) any_debug_changes = true; } @@ -1433,6 +1441,7 @@ pass_cprop_hardreg::execute (function *fun) df_analyze (); if (MAY_HAVE_DEBUG_BIND_INSNS && any_debug_changes) cprop_hardreg_debug (fun, all_vd); + std::swap (curr, next); } free (all_vd);