On Fri, 5 Feb 2016, Richard Biener wrote:

> 
> The following patch fixes the performance regression for 435.gromacs
> on x86_64 with AVX2 (Haswell or bdver2) caused by
> 
> 2015-12-18  Andreas Krebbel  <kreb...@linux.vnet.ibm.com>
> 
>       * ira.c (ira_setup_alts): Move the scan for commutative modifier
>       to the first loop to make it work even with disabled alternatives.
> 
> which in itself is a desirable change giving the RA more freedom.
> 
> It turns out the fix makes an existing issue more severe in detecting
> more swappable alternatives and thus exiting ira_setup_alts with
> operands swapped in recog_data.  This seems to give a slight preference
> to choose alternatives with the operands swapped (I didn't try to
> investigate how IRA handles the "merged" alternative mask and
> operand swapping in its further processing).
> 
> Of course previous RTL optimizers and canonicalization rules as well
> as backend patterns are tuned towards the not swapped variant and thus
> it happens doing more swaps ends up in slower code (I didn't closely
> investigate).
> 
> So I tested the following patch which simply makes sure that
> ira_setup_alts does not alter recog_data.
> 
> On a Intel Haswell machine I get (base is with the patch, peak is with
> the above change reverted):
> 
>                                   Estimated                       
> Estimated
>                 Base     Base       Base        Peak     Peak       Peak
> Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
> -------------- ------  ---------  ---------    ------  ---------  
> ---------
> 435.gromacs      7140        264       27.1 S    7140        270       
> 26.5 S
> 435.gromacs      7140        264       27.1 *    7140        269       
> 26.6 S
> 435.gromacs      7140        263       27.1 S    7140        269       
> 26.5 *
> ==============================================================================
> 435.gromacs      7140        264       27.1 *    7140        269       
> 26.5 *
> 
> which means the patched result is even better than before Andreas
> change.  Current trunk homes in at a Run Time of 321s (which is
> the regression to fix).
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?

It fails

FAIL: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1

on i?86 (or x86_64 -m32) though, generating

f:
.LFB0:
        .cfi_startproc
        movl    4(%esp), %eax
        leal    1(%eax), %edx
        movsbl  a+1(%eax), %eax
        movsbl  b(%edx), %edx
        addl    %edx, %eax
        ret

before IRA we have

(insn 6 3 8 2 (parallel [
            (set (reg:SI 87 [ _2 ])
                (plus:SI (reg/v:SI 93 [ i ])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) 
/space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 
218 {*addsi_1}
     (expr_list:REG_DEAD (reg/v:SI 93 [ i ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 8 6 10 2 (set (reg:SI 96)
        (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 87 [ _2 ])
                    (symbol_ref:SI ("a") [flags 0x2]  <var_decl 
0x7fe25bdf0cf0 a>)) [0 a S1 A8]))) 
/space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 
151 {extendqisi2}
     (nil))
(insn 10 8 11 2 (set (reg:SI 98)
        (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 87 [ _2 ])
                    (symbol_ref:SI ("b") [flags 0x2]  <var_decl 
0x7fe25bdf0d80 b>)) [0 b S1 A8]))) 
/space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 
151 {extendqisi2}
     (expr_list:REG_DEAD (reg:SI 87 [ _2 ])

where I wonder if we can use a single insn why we didn't combine/forwprop
to this before RA.  Probably non-single-use issue (though I thought
forwprop doesn't have this limitation).

Without the patch it is postreload that manages to combine them.
After the patch we allocate dx so that

(insn 10 8 11 2 (set (reg:SI 1 dx [98])
        (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 1 dx [orig:87 _2 ] 
[87])
                    (symbol_ref:SI ("b") [flags 0x2]  <var_decl 
0x7f02a1bd9d80 b>)) [0 b S1 A8]))) 
/space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 
151 {extendqisi2}
     (nil))

which seems to confuse that transform enough to not carry it out
(reload_combine).  Hopefully somebody else can help here in a more
efficient way than I could do.

IMHO it shouldn't block the patch if that looks good itself (I think
it's sound).  I'll have a look at the reload_combine issue early
next week if nobody beats me to it.  PR69689.

Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't
show any regressions.

Richard.

> Thanks,
> Richard.
> 
> 2016-02-05   Richard Biener  <rguent...@suse.de>
> 
>       PR rtl-optimization/69274
>       * ira.c (ira_setup_alts): Do not change recog_data.operand
>       order.
> 
> Index: gcc/ira.c
> ===================================================================
> --- gcc/ira.c (revision 231814)
> +++ gcc/ira.c (working copy)
> @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG
>       }
>        if (commutative < 0)
>       break;
> -      if (curr_swapped)
> -     break;
> +      /* Swap forth and back to avoid changing recog_data.  */
>        std::swap (recog_data.operand[commutative],
>                recog_data.operand[commutative + 1]);
> +      if (curr_swapped)
> +     break;
>      }
>  }
>  
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Reply via email to