Re: GCC missing -flto optimizations? SPEC lbm benchmark

2019-02-15 Thread Jun Ma
Bin.Cheng  于2019年2月15日周五 下午5:12写道:

> On Fri, Feb 15, 2019 at 3:30 AM Steve Ellcey  wrote:
> >
> > I have a question about SPEC CPU 2017 and what GCC can and cannot do
> > with -flto.  As part of some SPEC analysis I am doing I found that with
> > -Ofast, ICC and GCC were not that far apart (especially spec int rate,
> > spec fp rate was a slightly larger difference).
> >
> > But when I added -ipo to the ICC command and -flto to the GCC command,
> > the difference got larger.  In particular the 519.lbm_r was more than
> > twice as fast with ICC and -ipo, but -flto did not help GCC at all.
> >
> > There are other tests that also show this type of improvement with -ipo
> > like 538.imagick_r, 544.nab_r, 525.x264_r, 531.deepsjeng_r, and
> > 548.exchange2_r, but none are as dramatic as 519.lbm_r.  Anyone have
> > any idea on what ICC is doing that GCC is missing?  Is GCC just not
> > agressive enough with its inlining?
>
> IIRC Jun did some investigation before? CCing.
>
> Thanks,
> bin
> >
> > Steve Ellcey
> > sell...@marvell.com

ICC is doing much more than GCC in ipo, especially memory layout
optimizations. See https://software.intel.com/en-us/node/522667.
ICC is more aggressive in array transposition/structure splitting
/field reordering. However, these optimizations have been removed
from GCC long time ago.
As for case lbm_r, IIRC a loop with memory access which stride is 20 is
most time-consuming.  ICC will optimize the array(maybe structure?)
and vectorize the loop under ipo.

Thanks
Jun


Re: [EXT] Re: GCC missing -flto optimizations? SPEC lbm benchmark

2019-02-16 Thread Jun Ma
Steve Ellcey  于2019年2月16日周六 上午1:53写道:

> On Fri, 2019-02-15 at 17:48 +0800, Jun Ma wrote:
> >
> > ICC is doing much more than GCC in ipo, especially memory layout
> > optimizations. See https://software.intel.com/en-us/node/522667.
> > ICC is more aggressive in array transposition/structure splitting
> > /field reordering. However, these optimizations have been removed
> > from GCC long time ago.
> > As for case lbm_r, IIRC a loop with memory access which stride is 20 is
> > most time-consuming.  ICC will optimize the array(maybe structure?)
> > and vectorize the loop under ipo.
> >
> > Thanks
> > Jun
>
> Interesting.  I tried using '-qno-opt-mem-layout-trans' on ICC
> along with '-Ofast -ipo' and that had no affect on the speed.  I also
> tried '-no-vec' and that had no affect either.  The only thing that
> slowed down ICC was '-ip-no-inlining' or '-fno-inline'.  I see that
> '-Ofast -ipo' resulted in everything (except libc functions) getting
> inlined into the main program when using ICC.  GCC did not do that, but
> if I forced it to by using the always_inline attribute, GCC could
> inline everything into main the way ICC does.  But that did not speed
> up the GCC executable.
>
> Steve Ellcey
> sell...@marvell.com

 you can use '-qopt-report' to see which optimizations has been applied by
icc.

Thanks
Jun