On Thu, Jun 2, 2022 at 12:20 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > > This middle-end patch proposes the "hard register constant propagation" > > pass be performed up to three times on each basic block (up from the > > current two times) if the second pass successfully made changes. > > > > The motivation for three passes is to handle the "swap idiom" (i.e. > > t = x; x = y; y = t;" sequences) that get generated by register allocation > > (reload). > > > > Consider the x86_64 test case for __int128 addition recently discussed > > on gcc-patches. With that proposed patch, the input to the cprop_hardreg > > pass looks like: > > > > movq %rdi, %r8 > > movq %rsi, %rdi > > movq %r8, %rsi > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %rsi %rax > > adcq %rdi, %rdx > > ret > > > > where the first three instructions effectively swap %rsi and %rdi. > > > > On the first pass of cprop_hardreg, we notice that the third insn, > > %rsi := %r8, is redundant and can eliminated/propagated to produce: > > > > movq %rdi, %r8 > > movq %rsi, %rdi > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %r8 %rax > > adcq %rdi, %rdx > > ret > > > > Because a successful propagation was found, cprop_hardreg then runs > > a second pass/sweep on affected basic blocks (using worklist), and > > on this second pass notices that the second instruction, %rdi := %rsi, > > may now be propagated (%rsi was killed in the before the first transform), > > and after a second pass, we now end up with: > > > > movq %rdi, %r8 > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %r8, %rax > > adcq %rsi, %rdx > > ret > > > > which is the current behaviour on mainline. However, a third and final > > pass would now notice that the first insn, "%r8 := %rdi" is also now > > eliminable, and a third iteration would produce optimal code: > > > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %rdi, %rax > > adcq %rsi, %rdx > > ret > > > > The patch below creates an additional worklist, third_pass, that is > > populated with the basic block id's of blocks that were updated during > > the second pass. Does the motivation for three passes (reload doesn't > > generate more or less than three instructions to swap a pair of registers) > > seem reasonable for all targets? If cprop_hardreg is considered an > > expensive pass, this change could be gated based on basic block count > > or similar. Finally, I should point out that this a regression fix; > > GCC 4.8 generated optimal code with two moves (whereas GCC 12 required > > 5 moves, up from GCC 11's 4 moves). > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} with > > no new failures. Thoughts? Ok for mainline?
Can you instead refactor the code to have auto_vec<int> worklist1, worklist2; vec<int> *worklist = &worklist1; and alternate between worklist1 and worklist2 N times (const int N = 2 is OK for now I guess, OTOH for -O1 we might want to stick to 1). it might be tempting to maintain the number of times we visited a block and just directly iterate on changed BBs. There's also the comment in cprop_hardreg_bb: /* If a block has a single predecessor, that we've already processed, begin with the value data that was live at the end of the predecessor block. */ /* ??? Ought to use more intelligent queuing of blocks. */ which suggests we want to improve on its dataflow (like iterating in RPO order so we can intersect the live value data of predecessors?) That said, calling df_analyze () between iterations is what makes iterating expensive I guess, as well as cprop_hardreg_debug. Richard. > > > > > 2022-06-02 Roger Sayle <ro...@nextmovesoftware.com> > > > > gcc/ChangeLog > > * regcprop.cc (pass_cprop_hardreg::execute): Perform a third > > iteration over each basic block that was updated by the second > > iteration. > > > > > > Thanks in advance, > > Roger > > -- > > >