On Wed, Sep 29, 2010 at 2:16 PM, Bingfeng Mei <b...@broadcom.com> wrote: > Hello, > I have been examining a significant performance regression > between 4.5 and 4.4 in our port. I found that Partial Redundancy > Elimination introduced in 4.5 causes the issue. The following > pseudo code explains the problem: > > BB 3: > r118 <- r114 + 2 > > BB 4: > R114 <- r114 + 2 > ... > Conditional jump to BB 4 > > After PRE > > BB 3: > r123 <- r114 + 2 > r118 <- r123 > > BB 4: > r114 <- r123 > conditional jump to BB 5 > > BB5: > r123 <- r114 + 2 > jump to BB 4 > > > A simple loop (BB 4) is divided into two basic blocks (BB 4 & 5). > An extra jump instruction is introduced. On some targets, this > jump can be removed by bb-reorder pass. On our target, it cannot > be reordered due to complex doloop_end pattern we generate later. > Additionally, since bb-reorder is done in very late phase, the code > miss some optimization opportunity such as auto_inc_dec. I don't > see any benefit here to do PRE like this. Maybe we should exclude > such case in the first place? I read the relevant text in > "Advanced Compiler Design Implementation", the example used is linear > CFG and it doesn't mention how to handle loop case.
PRE basically sinks the computation into the latch block (possibly creating that). Note that without a testcase it's hard to tell whether this is ok in general. PRE tries to avoid generation of new induction variables and cross-iteration data-dependences, see insert_into_preds_of_block. Note that 4.4 in principle performs the same optimization (you might figure that PRE in 4.4 is generally disabled for -Os but enabled in 4.5, but only for hot execution traces following existing practice to tune code-size/performance on a fine-grained basis). Richard. > Any suggestion is greatly appreciated. > > Thanks, > Bingfeng Mei > > > > > > > >