Hi, this patch increases lto-partitions to 128. This makes ltrans.o file sizes to grow from 458MB to 651MB which is still not perfect but a lot better than prevoiusly. On firefox the growth is smaller (only about 10%) which is probably caused by the "unified build" they use where they merge multiple sources via #include to reduce number of objects "only" to about 8000. I will do testing w/o unified build this week as well.
What is however interesting that even on my 8core 16hyperthread buldozer machine this reduces both overall time and user time: partitions real user sys 16: 4m25.586s 30m0.760s 0m21.772s 32: 4m16.163s 28m58.992s 0m28.996s 32: 3m17.889s 28m57.012s 0m29.084s 64: 2m55.663s 27m46.344s 0m39.568s 64: 2m57.010s 27m48.812s 0m39.192s 128: 2m52.978s 27m43.616s 0m47.964s 256: 2m54.915s 27m56.324s 1m2.272s 512: 3m2.762s 28m20.696s 1m25.616s 512: 3m1.851s 28m20.124s 1m23.812s 1to1: 4m34.263s 31m49.760s 1m56.804s Firefox actually preffers even more partitions: it seems that ideal size for partition memory use is about 80MB which is probably hard to achieve generally. I plan to fine tune this at begining of stage3 but I want to increase partitioning now so we hit possible negative performance effects earlier. WPA stage having some ovbvious bottle necks: Time variable usr sys wall GGC phase opt and generate : 39.34 ( 75%) 0.62 ( 6%) 39.98 ( 65%) 360751 kB ( 26%) phase stream in : 11.88 ( 23%) 0.46 ( 5%) 12.36 ( 20%) 1050929 kB ( 74%) ipa function summary : 0.17 ( 0%) 0.03 ( 0%) 0.23 ( 0%) 68036 kB ( 5%) ipa cp : 0.83 ( 2%) 0.07 ( 1%) 0.98 ( 2%) 127680 kB ( 9%) ipa inlining heuristics : 30.90 ( 59%) 0.05 ( 1%) 30.96 ( 50%) 118731 kB ( 8%) lto stream inflate : 2.94 ( 6%) 0.15 ( 2%) 2.95 ( 5%) 0 kB ( 0%) ipa lto gimple in : 1.10 ( 2%) 0.32 ( 3%) 1.32 ( 2%) 162967 kB ( 12%) ipa lto decl in : 7.51 ( 14%) 0.18 ( 2%) 7.77 ( 13%) 748707 kB ( 53%) whopr partitioning : 1.45 ( 3%) 0.02 ( 0%) 1.48 ( 2%) 5451 kB ( 0%) ipa icf : 2.71 ( 5%) 0.07 ( 1%) 2.76 ( 4%) 12571 kB ( 1%) TOTAL : 52.15 9.62 61.86 1413731 kB - we may be in position to look for faster compression library (to save 6% of WPA) - icf and profile merging still brings in too many function bodies (to save 12% of GGC memory) - inliner got slower. Reason is twofold. It now spends about 15% in the hashtable mapping summaries to symbol nodes (we used to have an array which was removed by Martin) and we do spend a lot of time in sreal computation. This can be microoptimized + I have some patches to speed it up noticeably by getting functions contextes handled better. - I have noticed that ltrans spends absurt amount of time in lookup_external_ref (up to 20% in large partitions) which may affect the table above in favour of more partitioning. Still we could get important wins by reducing amount of decl streaming (I will do some tests on simplifing function types, arrays and enums to see if there is low hanging fruit left) but we do a lot better than ever brefore. Bootstrapped/regtested x86_64-linux, comitted. Honza * params.def (lto-partitions): Set to 128 (instead of 32). Index: params.def =================================================================== --- params.def (revision 265573) +++ params.def (working copy) @@ -1103,7 +1103,7 @@ DEFPARAM (PARAM_IPA_MAX_AA_STEPS, DEFPARAM (PARAM_LTO_PARTITIONS, "lto-partitions", "Number of partitions the program should be split to.", - 32, 1, 0) + 128, 1, 0) DEFPARAM (MIN_PARTITION_SIZE, "lto-min-partition",