On 10/30/18 10:46 PM, Jan Hubicka wrote: > Hi, > this patch increases lto-partitions to 128. This makes ltrans.o file sizes to > grow from 458MB to 651MB which is still not perfect but a lot better than > prevoiusly. On firefox the growth is smaller (only about 10%) which is > probably caused by the "unified build" they use where they merge multiple > sources via #include to reduce number of objects "only" to about 8000. > I will do testing w/o unified build this week as well.
Hi. That sounds promising! > > What is however interesting that even on my 8core 16hyperthread buldozer > machine this reduces both overall time and user time: > > partitions real user sys > 16: 4m25.586s 30m0.760s 0m21.772s > 32: 4m16.163s 28m58.992s 0m28.996s > 32: 3m17.889s 28m57.012s 0m29.084s > 64: 2m55.663s 27m46.344s 0m39.568s > 64: 2m57.010s 27m48.812s 0m39.192s > 128: 2m52.978s 27m43.616s 0m47.964s > 256: 2m54.915s 27m56.324s 1m2.272s > 512: 3m2.762s 28m20.696s 1m25.616s > 512: 3m1.851s 28m20.124s 1m23.812s > > 1to1: 4m34.263s 31m49.760s 1m56.804s > > Firefox actually preffers even more partitions: it seems that ideal size for > partition memory use is about 80MB which is probably hard to achieve > generally. > I plan to fine tune this at begining of stage3 but I want to increase > partitioning now so we hit possible negative performance effects earlier. > > WPA stage having some ovbvious bottle necks: > Time variable usr sys > wall GGC > phase opt and generate : 39.34 ( 75%) 0.62 ( 6%) 39.98 ( > 65%) 360751 kB ( 26%) > phase stream in : 11.88 ( 23%) 0.46 ( 5%) 12.36 ( > 20%) 1050929 kB ( 74%) > ipa function summary : 0.17 ( 0%) 0.03 ( 0%) 0.23 ( > 0%) 68036 kB ( 5%) > ipa cp : 0.83 ( 2%) 0.07 ( 1%) 0.98 ( > 2%) 127680 kB ( 9%) > ipa inlining heuristics : 30.90 ( 59%) 0.05 ( 1%) 30.96 ( > 50%) 118731 kB ( 8%) > lto stream inflate : 2.94 ( 6%) 0.15 ( 2%) 2.95 ( > 5%) 0 kB ( 0%) > ipa lto gimple in : 1.10 ( 2%) 0.32 ( 3%) 1.32 ( > 2%) 162967 kB ( 12%) > ipa lto decl in : 7.51 ( 14%) 0.18 ( 2%) 7.77 ( > 13%) 748707 kB ( 53%) > whopr partitioning : 1.45 ( 3%) 0.02 ( 0%) 1.48 ( > 2%) 5451 kB ( 0%) > ipa icf : 2.71 ( 5%) 0.07 ( 1%) 2.76 ( > 4%) 12571 kB ( 1%) > TOTAL : 52.15 9.62 61.86 > 1413731 kB > > - we may be in position to look for faster compression library (to save 6% > of WPA) > - icf and profile merging still brings in too many function bodies (to save > 12% of GGC memory) Will take a look at ICF, maybe we can make hash function more fine. > - inliner got slower. Reason is twofold. It now spends about 15% in the > hashtable mapping summaries to symbol nodes (we used to have an array which > was removed by Martin) Is the problematic one ipa_call_summaries ? and we do spend a lot of time in sreal computation. > This can be microoptimized + I have some patches to speed it up noticeably > by getting functions contextes handled better. I can also help with that if you guide me. Martin > - I have noticed that ltrans spends absurt amount of time in > lookup_external_ref (up to 20% in large partitions) which may affect the > table > above in favour of more partitioning. > > Still we could get important wins by reducing amount of decl streaming > (I will do some tests on simplifing function types, arrays and enums to see > if there is low hanging fruit left) but we do a lot better than ever brefore. > > Bootstrapped/regtested x86_64-linux, comitted. > > Honza > > > * params.def (lto-partitions): Set to 128 (instead of 32). > Index: params.def > =================================================================== > --- params.def (revision 265573) > +++ params.def (working copy) > @@ -1103,7 +1103,7 @@ DEFPARAM (PARAM_IPA_MAX_AA_STEPS, > DEFPARAM (PARAM_LTO_PARTITIONS, > "lto-partitions", > "Number of partitions the program should be split to.", > - 32, 1, 0) > + 128, 1, 0) > > DEFPARAM (MIN_PARTITION_SIZE, > "lto-min-partition", >