On 02/05/2016 04:25 AM, Richard Biener wrote:
The following patch fixes the performance regression for 435.gromacs
on x86_64 with AVX2 (Haswell or bdver2) caused by
2015-12-18 Andreas Krebbel <kreb...@linux.vnet.ibm.com>
* ira.c (ira_setup_alts): Move the scan for commutative modifier
to the first loop to make it work even with disabled alternatives.
which in itself is a desirable change giving the RA more freedom.
It turns out the fix makes an existing issue more severe in detecting
more swappable alternatives and thus exiting ira_setup_alts with
operands swapped in recog_data. This seems to give a slight preference
to choose alternatives with the operands swapped (I didn't try to
investigate how IRA handles the "merged" alternative mask and
operand swapping in its further processing).
Alternative mask excludes alternatives which will be definitely rejected
in LRA. This approach is to speed up LRA (a lot was done to speed up RA
but still it consumes a big chunk of compiler time which is unusual for
all compilers).
LRA and reload prefer insns without commutative operands swap when all
other costs are the same.
Of course previous RTL optimizers and canonicalization rules as well
as backend patterns are tuned towards the not swapped variant and thus
it happens doing more swaps ends up in slower code (I didn't closely
investigate).
So I tested the following patch which simply makes sure that
ira_setup_alts does not alter recog_data.
On a Intel Haswell machine I get (base is with the patch, peak is with
the above change reverted):
Estimated
Estimated
Base Base Base Peak Peak Peak
Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio
-------------- ------ --------- --------- ------ ---------
---------
435.gromacs 7140 264 27.1 S 7140 270
26.5 S
435.gromacs 7140 264 27.1 * 7140 269
26.6 S
435.gromacs 7140 263 27.1 S 7140 269
26.5 *
==============================================================================
435.gromacs 7140 264 27.1 * 7140 269
26.5 *
which means the patched result is even better than before Andreas
change. Current trunk homes in at a Run Time of 321s (which is
the regression to fix).
Thanks for working on this, Richard. It is not easy to find reasons
for worse code on modern processors after such small change. As RA is
based on heuristics it hard to predict the change for a specific
benchmark. I remember I checked Andreas patch on SPEC2000 in a hope
that it also improves x86-64 code but I did not see a difference.
It is even hard to say sometimes how a specific (non-heuristic)
optimization will affect a specific benchmark performance when a lot of
unknown (from alignments to CPU internals are involved). An year ago I
tried to use ML to choose best options. I used a set of about 100 C
benchmarks (and even more functions). For practically every benchmark,
I had an option modification to -Ofast resulting in faster code but ML
prediction did not work at all.
Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?
OK. Thanks again.
2016-02-05 Richard Biener <rguent...@suse.de>
PR rtl-optimization/69274
* ira.c (ira_setup_alts): Do not change recog_data.operand
order.
Index: gcc/ira.c
===================================================================
--- gcc/ira.c (revision 231814)
+++ gcc/ira.c (working copy)
@@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG
}
if (commutative < 0)
break;
- if (curr_swapped)
- break;
+ /* Swap forth and back to avoid changing recog_data. */
std::swap (recog_data.operand[commutative],
recog_data.operand[commutative + 1]);
+ if (curr_swapped)
+ break;
}
}