Here is mine benchmarking of the current LTO branch on 2.66Ghz Core2 under RHEL 5 in 64- and 32-bits mode. The vortex violates type aliasing rules, therefore it should be compiled with -fno-strict-aliasing. Perlbmk crashed in tree.c::build2_stat in 32-bits mode when LTO used. LTO currently generates wrong code for 176.gcc. I've also checked Specfp2000 benchmarks written in C.
In brief, o the code size (text segment) with LTO is much smaller (2.7% and 2.4% for SpecInt and 0.16% and 0.6% for SpecFp correspondingly in 64- and 32-bit mode). That is very promising. o the compilation is 2 times slower with LTO. o The generated code is slower 3.6% and 2.2% for SPECint2000 and SpecFp2000 in 64-bit mode. It is also 6.7% slower for SpecInt2000 in 32-bit mode. But SpecFp2000 in 32-bit mode code generated with LTO is 20% faster! It is because art is almost 2.5 times faster with LTO. The more details can be found below. --------------------------64-bit mode---------------------------- base: -O2 -mtune=generic peak: -O2 -mtune=generic -flto base peak 164.gzip 1363* 1340* 175.vpr 1600* 1571* 176.gcc X X 181.mcf 1658* 1531* 186.crafty 2576* 2569* 197.parser 1269* 1158* 252.eon X X 253.perlbmk 2546* 2373* 254.gap 1987* 1965* 255.vortex 2259* 2208* 256.bzip2 1874* 1721* 300.twolf 2548* 2627* SPECin2000 mean 1910 1841 -3.6% Compilation time of SPECInt2000 (except for eon and gcc): base: 65.02user 6.25system 1:15.41elapsed 94%CPU peak: 130.62user 9.68system 2:45.20elapsed 84%CPU base peak 168.wupwise X X 171.swim X X 172.mgrid X X 173.applu X X 177.mesa 2426* 2314* 178.galgel X X 179.art 6276* 5519* 183.equake 1826* 1808* 187.facerec X X 188.ammp 1770* 1666* 189.lucas X X 191.fma3d X X 200.sixtrack X X 301.apsi X X SPECfp_base2000 2649 2491 -2.2% Compilation time of SPECFp2000 (only mesa, art, equake ammp): 17.32user 1.74system 0:20.42elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 35.52user 2.88system 0:42.86elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k text segment: ----------------CINT2000----------------- -6.144% 38962 36568 164.gzip -3.500% 147426 142266 175.vpr -4.313% 12613 12069 181.mcf -2.544% 172319 167935 186.crafty -5.566% 108797 102741 197.parser -5.436% 575443 544160 253.perlbmk -5.214% 494375 468599 254.gap -5.617% 556589 525325 255.vortex -3.209% 32532 31488 256.bzip2 1.132% 198639 200887 300.twolf Average = -2.69418% ----------------CFP2000----------------- -5.093% 522117 495526 177.mesa 2.542% 16362 16778 179.art 2.745% 19778 20321 183.equake -2.919% 142532 138372 188.ammp Average = -0.160212% --------------------------32-bit mode---------------------------- base: -m32 -O2 -mtune=generic peak: -m32 -O2 -mtune=generic -flto base peak 164.gzip 1261* 1125* 175.vpr 1603* 1483* 176.gcc X X 181.mcf 3057* 2801* 186.crafty 1764* 1691* 197.parser 1397* 1224* 252.eon X X 253.perlbmk X X 254.gap 1981* 1778* 255.vortex 2013* 1914* 256.bzip2 1666* 1580* 300.twolf 2376* 2484* SPECint2000mean 1839 1716 -6.7% Compilation time of SPECInt2000 (except for eon, gcc, and perlbmk): 49.36user 5.13system 0:58.57elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 99.32user 7.90system 1:56.63elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k base peak 168.wupwise X X 171.swim X X 172.mgrid X X 173.applu X X 177.mesa 1362* 1325* 178.galgel X X 179.art 2786* 6197* 183.equake 1784* 1772* 187.facerec X X 188.ammp 1144* 1102* 189.lucas X X 191.fma3d X X 200.sixtrack X X 301.apsi X X SPECfp2000 mean 1668 2001 +20% Compilation time of SPECFp2000 (only mesa, art, equake ammp): 17.88user 1.85system 0:21.17elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 36.76user 2.83system 0:43.81elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k text segment: ----------------CINT2000----------------- -5.936% 35005 32927 164.gzip -5.125% 137683 130627 175.vpr -3.739% 10270 9886 181.mcf -1.379% 195472 192776 186.crafty -5.192% 94770 89850 197.parser -5.436% 575443 544160 253.perlbmk -4.400% 449316 429544 254.gap -2.219% 564982 552446 255.vortex -2.884% 30515 29635 256.bzip2 0.167% 193748 194072 300.twolf Average = -2.40954% ----------------CFP2000----------------- -5.796% 499738 470775 177.mesa 0.458% 13971 14035 179.art 0.303% 17467 17520 183.equake -5.176% 111429 105661 188.ammp Average = -0.600618% Nathan Froyd wrote:
In one of my recent messages about a patch to the LTO branch, I mentioned that we could compile and successfully run all of the C SPECint benchmarks except 176.gcc. Chris Lattner asked if I had done any benchmarking now that real programs could be run; I said that I hadn't but would try to do some soon. This is the result of that. I don't have numbers on what compile times look like, but I don't think they're good. 176.gcc takes several minutes to compile (basically -flto *.o, not counting the time to compile individual .o files); the other benchmarks are all a minute or more apiece. Executive summary: LTO is currently *not* a win. In the table below, runtimes are in seconds. I ran the tests on an 8-core 1.6GHz machine with 8 GB RAM. I believe the machine was relatively idle; I ran the tests over a weekend evening. The last merge from mainline to the LTO branch was mainline r130155, so that's about what the -O2 numbers correspond to--I don't think we've changed too much core code on the branch. The % change are just in-my-head estimates, using -O2 as a baseline. -O2 -flto % change 164.gzip 174 176 + 1 175.vpr 139 143 + 3 181.mcf 162 166 + 3 186.crafty 65.2 66.6 + < 1 197.parser 240 261 + 9 253.perlbmk 119 133 + 13 254.gap 84.4 87 + 4 256.bzip2 131 145 + 11 300.twolf 202 193 - 4 (!) 176.gcc doesn't run correctly with LTO yet; 255.vortex didn't run correctly with "mainline", but it did with -flto, which is curious. We don't do C++ yet, so 252.eon is not included. In general, things get worse with LTO, sometimes much worse. I can think of at least three possible reasons off the top of my head: - Alias information. We don't have any type-based alias information in -flto, which hurts. - We don't merge types between compilation units, which could account for poor optimization behavior. - I believe we lose some information in the LTO write/read process; edge probabilities, estimated # instructions in functions, etc. get lost. This hurts inlining decisions, block layout, alignment of jump targets, etc. So there's information we need to write out or recompute. -Nathan