On 10/18/22 15:51, Vineet Gupta wrote:
Where BB4 corresponds to .L2 and BB6 corresponds to .L3. Evaluation
of the constants occurs in BB3 and BB5.
And Evaluation here means use of the constant (vs. definition ?).
In this case, use of the constant.
PRE/GCSE is better suited for this scenario, but it has a critical
constraint. In particular our PRE formulation is never allowed to
put an evaluation of an expression on a path that didn't have one
before. So
while there clearly a redundancy on the path 2->3->4->5 (BB3 and
BB5), there is nowhere we could put an evaluation that would reduce
the number of evaluation on that path without introducing an
evaluation on paths that didn't have one. So consider 2->4->6. On
that path there are zero evaluations. So we can't place an eval in
BB2 because that will cause evaluations on 2->4->6 which didn't have
any evaluations.
OK. How does PRE calculate all possible paths to consider: say your
example 2-3-4-5 and 2-4-6 ? Is that just indicative or would actually
be the one PRE calculates for this case. Would there be more ?
PRE has a series of dataflow equations it solves which gives it the
answer to that question. The one that computes this property is usually
called anticipated. Given some block B in a graph G. An expression is
anticipated at B when the expression is guaranteed to be computed if we
reach B. That doesn't mean the evaluation must happen in B, just that
evaluation at some point is guaranteed if we reach B.
If an expression is not anticipated in a block, then PRE will not insert
in that block since doing so would add evaluations on paths where they
did not previously have any.
There isn't a great place in GCC to handle this right now. If the
constraints were relaxed in PRE, then we'd have a chance, but getting
the cost model right is going to be tough.
It would have been better (for this specific case) if loop unrolling
was not being done so early. The tree pass cunroll is flattening it
out and leaving for rest of the all tree/rtl passes to pick up the
pieces and remove any redundancies, if at all. It obviously needs to
be early if we are injecting 7x more instructions, but seems like a
lot to unravel.
Yup. If that loop gets unrolled, it's going to be a mess. It will
almost certainly make this problem worse as each iteration is going to
have a pair of constants loaded and no good way to remove them.
FWIW -fno-unroll-loops only seems to work at -O2. At -O3 it always
unrolls. Is that expected ?
The only case I'm immediately aware of where this wouldn't work would be
if -O3 came after -fno-unroll-oops.
If this seems worthwhile and you have ideas to do this any better, I'd
be happy to work on this with some guidance.
I don't see a great solution here. Something like Cliff Click's work
might help, but it's far from a guarantee. Click's work essentially
throws away the PRE constraint about never inserting an expression
evaluation on a path where it didn't exit, along with all kinds of other
things. Essentially it's a total reformulation of redundancy elimination.
I did an implementation eons ago in gimple, but never was able to
convince myself the implementation was correct or that integrating it
was a good thing. It's almost certainly going to cause performance
regressions elsewhere so it may end up doing more harm than good. I
don't really know.
https://courses.cs.washington.edu/courses/cse501/06wi/reading/click-pldi95.pdf
Jeff