Here is mine benchmarking of the current LTO branch on 2.66Ghz Core2
under RHEL 5 in 64- and 32-bits mode.  The vortex violates type
aliasing rules, therefore it should be compiled with
-fno-strict-aliasing.  Perlbmk crashed in tree.c::build2_stat in
32-bits mode when LTO used.  LTO currently generates wrong code for
176.gcc.  I've also checked Specfp2000 benchmarks written in C.
In brief,

 o the code size (text segment) with LTO is much smaller (2.7% and
   2.4% for SpecInt and 0.16% and 0.6% for SpecFp correspondingly in 64-
   and 32-bit mode).  That is very promising.
 o the compilation is 2 times slower with LTO.
 o The generated code is slower 3.6% and 2.2% for SPECint2000 and
   SpecFp2000 in 64-bit mode.  It is also 6.7% slower for SpecInt2000 in
   32-bit mode.  But SpecFp2000 in 32-bit mode code generated with LTO
   is 20% faster!  It is because art is almost 2.5 times faster with
   LTO.

The more details can be found below.

--------------------------64-bit mode----------------------------
base: -O2 -mtune=generic
peak: -O2 -mtune=generic -flto

                     base           peak
164.gzip              1363*          1340*
175.vpr               1600*          1571*
176.gcc                   X              X
181.mcf               1658*          1531*
186.crafty            2576*          2569*
197.parser            1269*          1158*
252.eon                   X              X
253.perlbmk           2546*          2373*
254.gap               1987*          1965*
255.vortex            2259*          2208*
256.bzip2             1874*          1721*
300.twolf             2548*          2627*
SPECin2000 mean       1910           1841    -3.6%

Compilation time of SPECInt2000 (except for eon and gcc):
base: 65.02user 6.25system 1:15.41elapsed 94%CPU
peak: 130.62user 9.68system 2:45.20elapsed 84%CPU

                   base        peak
168.wupwise            X           X
171.swim               X           X
172.mgrid              X           X
173.applu              X           X
177.mesa           2426*       2314*
178.galgel             X           X
179.art            6276*       5519*
183.equake         1826*       1808*
187.facerec            X           X
188.ammp           1770*       1666*
189.lucas              X           X
191.fma3d              X           X
200.sixtrack           X           X
301.apsi               X           X
SPECfp_base2000     2649        2491    -2.2%

Compilation time of SPECFp2000 (only mesa, art, equake ammp):
17.32user 1.74system 0:20.42elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
35.52user 2.88system 0:42.86elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k

text segment:
----------------CINT2000-----------------
-6.144%          38962          36568 164.gzip
-3.500%         147426         142266 175.vpr
-4.313%          12613          12069 181.mcf
-2.544%         172319         167935 186.crafty
-5.566%         108797         102741 197.parser
-5.436%         575443         544160 253.perlbmk
-5.214%         494375         468599 254.gap
-5.617%         556589         525325 255.vortex
-3.209%          32532          31488 256.bzip2
1.132%         198639         200887 300.twolf
Average = -2.69418%
----------------CFP2000-----------------
-5.093%         522117         495526 177.mesa
2.542%          16362          16778 179.art
2.745%          19778          20321 183.equake
-2.919%         142532         138372 188.ammp
Average = -0.160212%

--------------------------32-bit mode----------------------------
base: -m32 -O2 -mtune=generic
peak: -m32 -O2 -mtune=generic -flto

                base        peak
164.gzip         1261*        1125*
175.vpr          1603*        1483*
176.gcc              X            X
181.mcf          3057*        2801*
186.crafty       1764*        1691*
197.parser       1397*        1224*
252.eon              X            X
253.perlbmk          X            X
254.gap          1981*        1778*
255.vortex       2013*        1914*
256.bzip2        1666*        1580*
300.twolf        2376*        2484*
SPECint2000mean  1839         1716  -6.7%

Compilation time of SPECInt2000 (except for eon, gcc, and perlbmk):
49.36user 5.13system 0:58.57elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
99.32user 7.90system 1:56.63elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k

                     base        peak
168.wupwise              X           X
171.swim                 X           X
172.mgrid                X           X
173.applu                X           X
177.mesa             1362*       1325*
178.galgel               X           X
179.art              2786*       6197*
183.equake           1784*       1772*
187.facerec              X           X
188.ammp             1144*       1102*
189.lucas                X           X
191.fma3d                X           X
200.sixtrack             X           X
301.apsi                 X           X
SPECfp2000 mean       1668        2001  +20%

Compilation time of SPECFp2000 (only mesa, art, equake ammp):
17.88user 1.85system 0:21.17elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
36.76user 2.83system 0:43.81elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k

text segment:
----------------CINT2000-----------------
-5.936%          35005          32927 164.gzip
-5.125%         137683         130627 175.vpr
-3.739%          10270           9886 181.mcf
-1.379%         195472         192776 186.crafty
-5.192%          94770          89850 197.parser
-5.436%         575443         544160 253.perlbmk
-4.400%         449316         429544 254.gap
-2.219%         564982         552446 255.vortex
-2.884%          30515          29635 256.bzip2
0.167%         193748         194072 300.twolf
Average = -2.40954%
----------------CFP2000-----------------
-5.796%         499738         470775 177.mesa
0.458%          13971          14035 179.art
0.303%          17467          17520 183.equake
-5.176%         111429         105661 188.ammp
Average = -0.600618%


Nathan Froyd wrote:

In one of my recent messages about a patch to the LTO branch, I
mentioned that we could compile and successfully run all of the C
SPECint benchmarks except 176.gcc.  Chris Lattner asked if I had done
any benchmarking now that real programs could be run; I said that I
hadn't but would try to do some soon.  This is the result of that.

I don't have numbers on what compile times look like, but I don't think
they're good.  176.gcc takes several minutes to compile (basically -flto
*.o, not counting the time to compile individual .o files); the other
benchmarks are all a minute or more apiece.

Executive summary: LTO is currently *not* a win.

In the table below, runtimes are in seconds.  I ran the tests on an
8-core 1.6GHz machine with 8 GB RAM.  I believe the machine was
relatively idle; I ran the tests over a weekend evening.  The last merge
from mainline to the LTO branch was mainline r130155, so that's about
what the -O2 numbers correspond to--I don't think we've changed too much
core code on the branch.  The % change are just in-my-head estimates,
using -O2 as a baseline.

                -O2     -flto   % change
164.gzip        174     176     + 1
175.vpr         139     143     + 3
181.mcf         162     166     + 3
186.crafty      65.2    66.6    + < 1
197.parser      240     261     + 9
253.perlbmk     119     133     + 13
254.gap         84.4    87      + 4
256.bzip2       131     145     + 11
300.twolf       202     193     - 4 (!)

176.gcc doesn't run correctly with LTO yet; 255.vortex didn't run
correctly with "mainline", but it did with -flto, which is curious.  We
don't do C++ yet, so 252.eon is not included.

In general, things get worse with LTO, sometimes much worse.  I can
think of at least three possible reasons off the top of my head:

- Alias information.  We don't have any type-based alias information in
 -flto, which hurts.

- We don't merge types between compilation units, which could account
 for poor optimization behavior.

- I believe we lose some information in the LTO write/read process; edge
 probabilities, estimated # instructions in functions, etc. get lost.
 This hurts inlining decisions, block layout, alignment of jump
 targets, etc.  So there's information we need to write out or
 recompute.

-Nathan


Reply via email to