9 regression] Performance regression with code hoisting enabled

prathamesh3492 at gcc dot gnu.org Thu, 19 Jul 2018 02:54:29 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155


--- Comment #38 from prathamesh3492 at gcc dot gnu.org ---
Hi,
The issue can be reproduced exactly, with pr77445-2.c. I am testing with making
is_digit() noinline.

* Reordering SINK before PRE

SPEC2006 data for building SPEC2006 with sink before pre:
Number of statements sunk: +2677 (~ +14%)
Number of total PRE insertions: -3971 (~ -1%)
On the private embedded benchmark suite, there's overall no significant
difference.

Not sure if this is much helpful. Is there a way to get info about number of
registers spilled from lra dump or assembly ?
I would like to see the effect on spills by reordering passes.

Reordering sink before pre seems to regress no-scevccp-outer-22.c and
ssa-dom-thread-7.c, and several SVE tests on aarch64:
http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/262002-sink-pre/aarch64-none-linux-gnu/diff-gcc-rh60-aarch64-none-linux-gnu-default-default-default.txt

Also there seems to be some interplay with hoisting and forwprop. Disabling
forwprop3 and forwprop4 seems to eliminate the spill too. However as Bin
pointed out on the list, forwprop is also helping to reduce register pressure
for this case by mem_ref folding (forward_propagate_addr_expr).

* Jump threading cost models

It seems jump-threading pass increases the size for this case from 38 to 79
blocks. Wondering if that adds up to "resource hog", eventually leading
to extra spill ? Disabling jump threading pass eliminates the spill.

I looked a bit into fine tuning jump threading cost models for cortex-m7.
Strangely, setting max-jump-thread-duplication-stmts to 20 and
fsm-scale-path-stmts to 3 not only removes the spill but also results in 9 more
hoistings! I am investigating why this resulted
in improved performance. However it regresses ssa-dom-thread-7.c:
http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/262539-jump-thread-cost-models/aarch64-none-elf/diff-gcc-rh60-aarch64-none-elf-default-default-default.txt

* Stop-gap measure for hoisting ?

As a stop-gap measure, would it make sense to "localize" hoisting within
"large" loop (based on loop->num_nodes?) by refusing to hoist expressions
computed outside loop ?
My assumption is that hoisting will increase live range of expression which was
previously computed in a block outside loop but is brought inside the
loop due to hoisting since we'd now need to consider path along the loop as
well for estimating it's live-range ? I suppose a cheap way to test that would
be to check if block's post-dominator also lies within the same loop since it
would ensure all paths from block to EXIT would lie inside the loop ?
I created a patch for this
(http://people.linaro.org/~prathamesh.kulkarni/pdom.diff), which works to
remove the spill but regressed pr77445-2.c (which is how I stumbled on that
test). Although the underlying issue doesn't seem particularly relevant to
hoisting, so not sure if this "heuristic" makes much sense.

* Live range shrinking pass

There was some discussion about an inter-block live-range shrinking GIMPLE pass
on the list (https://gcc.gnu.org/ml/gcc/2018-05/msg00260.html), which will run
just before expand. I would be grateful for suggestions on how to get started
with it. I realize this'd be pretty hard, but would like to give a try. 

Thanks,
Prathamesh

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

Reply via email to