On 02/05/2016 04:25 AM, Richard Biener wrote:
The following patch fixes the performance regression for 435.gromacs
on x86_64 with AVX2 (Haswell or bdver2) caused by

2015-12-18  Andreas Krebbel  <kreb...@linux.vnet.ibm.com>

        * ira.c (ira_setup_alts): Move the scan for commutative modifier
        to the first loop to make it work even with disabled alternatives.

which in itself is a desirable change giving the RA more freedom.

It turns out the fix makes an existing issue more severe in detecting
more swappable alternatives and thus exiting ira_setup_alts with
operands swapped in recog_data.  This seems to give a slight preference
to choose alternatives with the operands swapped (I didn't try to
investigate how IRA handles the "merged" alternative mask and
operand swapping in its further processing).
Alternative mask excludes alternatives which will be definitely rejected in LRA. This approach is to speed up LRA (a lot was done to speed up RA but still it consumes a big chunk of compiler time which is unusual for all compilers).

LRA and reload prefer insns without commutative operands swap when all other costs are the same.
Of course previous RTL optimizers and canonicalization rules as well
as backend patterns are tuned towards the not swapped variant and thus
it happens doing more swaps ends up in slower code (I didn't closely
investigate).

So I tested the following patch which simply makes sure that
ira_setup_alts does not alter recog_data.

On a Intel Haswell machine I get (base is with the patch, peak is with
the above change reverted):

                                   Estimated
Estimated
                 Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------
---------
435.gromacs      7140        264       27.1 S    7140        270
26.5 S
435.gromacs      7140        264       27.1 *    7140        269
26.6 S
435.gromacs      7140        263       27.1 S    7140        269
26.5 *
==============================================================================
435.gromacs      7140        264       27.1 *    7140        269
26.5 *

which means the patched result is even better than before Andreas
change.  Current trunk homes in at a Run Time of 321s (which is
the regression to fix).
Thanks for working on this, Richard. It is not easy to find reasons for worse code on modern processors after such small change. As RA is based on heuristics it hard to predict the change for a specific benchmark. I remember I checked Andreas patch on SPEC2000 in a hope that it also improves x86-64 code but I did not see a difference.

It is even hard to say sometimes how a specific (non-heuristic) optimization will affect a specific benchmark performance when a lot of unknown (from alignments to CPU internals are involved). An year ago I tried to use ML to choose best options. I used a set of about 100 C benchmarks (and even more functions). For practically every benchmark, I had an option modification to -Ofast resulting in faster code but ML prediction did not work at all.
Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?

OK.  Thanks again.
2016-02-05   Richard Biener  <rguent...@suse.de>

        PR rtl-optimization/69274
        * ira.c (ira_setup_alts): Do not change recog_data.operand
        order.

Index: gcc/ira.c
===================================================================
--- gcc/ira.c   (revision 231814)
+++ gcc/ira.c   (working copy)
@@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG
        }
        if (commutative < 0)
        break;
-      if (curr_swapped)
-       break;
+      /* Swap forth and back to avoid changing recog_data.  */
        std::swap (recog_data.operand[commutative],
                 recog_data.operand[commutative + 1]);
+      if (curr_swapped)
+       break;
      }
  }

Reply via email to