On Fri, 5 Feb 2016, Richard Biener wrote: > > The following patch fixes the performance regression for 435.gromacs > on x86_64 with AVX2 (Haswell or bdver2) caused by > > 2015-12-18 Andreas Krebbel <kreb...@linux.vnet.ibm.com> > > * ira.c (ira_setup_alts): Move the scan for commutative modifier > to the first loop to make it work even with disabled alternatives. > > which in itself is a desirable change giving the RA more freedom. > > It turns out the fix makes an existing issue more severe in detecting > more swappable alternatives and thus exiting ira_setup_alts with > operands swapped in recog_data. This seems to give a slight preference > to choose alternatives with the operands swapped (I didn't try to > investigate how IRA handles the "merged" alternative mask and > operand swapping in its further processing). > > Of course previous RTL optimizers and canonicalization rules as well > as backend patterns are tuned towards the not swapped variant and thus > it happens doing more swaps ends up in slower code (I didn't closely > investigate). > > So I tested the following patch which simply makes sure that > ira_setup_alts does not alter recog_data. > > On a Intel Haswell machine I get (base is with the patch, peak is with > the above change reverted): > > Estimated > Estimated > Base Base Base Peak Peak Peak > Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio > -------------- ------ --------- --------- ------ --------- > --------- > 435.gromacs 7140 264 27.1 S 7140 270 > 26.5 S > 435.gromacs 7140 264 27.1 * 7140 269 > 26.6 S > 435.gromacs 7140 263 27.1 S 7140 269 > 26.5 * > ============================================================================== > 435.gromacs 7140 264 27.1 * 7140 269 > 26.5 * > > which means the patched result is even better than before Andreas > change. Current trunk homes in at a Run Time of 321s (which is > the regression to fix). > > Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?
It fails FAIL: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1 on i?86 (or x86_64 -m32) though, generating f: .LFB0: .cfi_startproc movl 4(%esp), %eax leal 1(%eax), %edx movsbl a+1(%eax), %eax movsbl b(%edx), %edx addl %edx, %eax ret before IRA we have (insn 6 3 8 2 (parallel [ (set (reg:SI 87 [ _2 ]) (plus:SI (reg/v:SI 93 [ i ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 218 {*addsi_1} (expr_list:REG_DEAD (reg/v:SI 93 [ i ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn 8 6 10 2 (set (reg:SI 96) (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 87 [ _2 ]) (symbol_ref:SI ("a") [flags 0x2] <var_decl 0x7fe25bdf0cf0 a>)) [0 a S1 A8]))) /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 151 {extendqisi2} (nil)) (insn 10 8 11 2 (set (reg:SI 98) (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 87 [ _2 ]) (symbol_ref:SI ("b") [flags 0x2] <var_decl 0x7fe25bdf0d80 b>)) [0 b S1 A8]))) /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 151 {extendqisi2} (expr_list:REG_DEAD (reg:SI 87 [ _2 ]) where I wonder if we can use a single insn why we didn't combine/forwprop to this before RA. Probably non-single-use issue (though I thought forwprop doesn't have this limitation). Without the patch it is postreload that manages to combine them. After the patch we allocate dx so that (insn 10 8 11 2 (set (reg:SI 1 dx [98]) (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 1 dx [orig:87 _2 ] [87]) (symbol_ref:SI ("b") [flags 0x2] <var_decl 0x7f02a1bd9d80 b>)) [0 b S1 A8]))) /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 151 {extendqisi2} (nil)) which seems to confuse that transform enough to not carry it out (reload_combine). Hopefully somebody else can help here in a more efficient way than I could do. IMHO it shouldn't block the patch if that looks good itself (I think it's sound). I'll have a look at the reload_combine issue early next week if nobody beats me to it. PR69689. Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't show any regressions. Richard. > Thanks, > Richard. > > 2016-02-05 Richard Biener <rguent...@suse.de> > > PR rtl-optimization/69274 > * ira.c (ira_setup_alts): Do not change recog_data.operand > order. > > Index: gcc/ira.c > =================================================================== > --- gcc/ira.c (revision 231814) > +++ gcc/ira.c (working copy) > @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG > } > if (commutative < 0) > break; > - if (curr_swapped) > - break; > + /* Swap forth and back to avoid changing recog_data. */ > std::swap (recog_data.operand[commutative], > recog_data.operand[commutative + 1]); > + if (curr_swapped) > + break; > } > } > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)