LTO partitioning performance & increase default number of partitions

Jan Hubicka Tue, 30 Oct 2018 14:47:25 -0700

Hi,
this patch increases lto-partitions to 128.  This makes ltrans.o file sizes to
grow from 458MB to 651MB which is still not perfect but a lot better than
prevoiusly.  On firefox the growth is smaller (only about 10%) which is
probably caused by the "unified build" they use where they merge multiple
sources via #include to reduce number of objects "only" to about 8000.
I will do testing w/o unified build this week as well.


What is however interesting that even on my 8core 16hyperthread buldozer
machine this reduces both overall time and user time:

partitions    real              user               sys      
16:           4m25.586s         30m0.760s          0m21.772s
32:           4m16.163s         28m58.992s         0m28.996s
32:           3m17.889s         28m57.012s         0m29.084s
64:           2m55.663s         27m46.344s         0m39.568s
64:           2m57.010s         27m48.812s         0m39.192s
128:          2m52.978s         27m43.616s         0m47.964s
256:          2m54.915s         27m56.324s         1m2.272s 
512:          3m2.762s          28m20.696s         1m25.616s
512:          3m1.851s          28m20.124s         1m23.812s

1to1:         4m34.263s         31m49.760s         1m56.804s

Firefox actually preffers even more partitions: it seems that ideal size for
partition memory use is about 80MB which is probably hard to achieve generally.
I plan to fine tune this at begining of stage3 but I want to increase
partitioning now so we hit possible negative performance effects earlier.

WPA stage having some ovbvious bottle necks:
Time variable                                   usr           sys          wall 
              GGC
 phase opt and generate             :  39.34 ( 75%)   0.62 (  6%)  39.98 ( 65%) 
 360751 kB ( 26%)
 phase stream in                    :  11.88 ( 23%)   0.46 (  5%)  12.36 ( 20%) 
1050929 kB ( 74%)
 ipa function summary               :   0.17 (  0%)   0.03 (  0%)   0.23 (  0%) 
  68036 kB (  5%)
 ipa cp                             :   0.83 (  2%)   0.07 (  1%)   0.98 (  2%) 
 127680 kB (  9%)
 ipa inlining heuristics            :  30.90 ( 59%)   0.05 (  1%)  30.96 ( 50%) 
 118731 kB (  8%)
 lto stream inflate                 :   2.94 (  6%)   0.15 (  2%)   2.95 (  5%) 
      0 kB (  0%)
 ipa lto gimple in                  :   1.10 (  2%)   0.32 (  3%)   1.32 (  2%) 
 162967 kB ( 12%)
 ipa lto decl in                    :   7.51 ( 14%)   0.18 (  2%)   7.77 ( 13%) 
 748707 kB ( 53%)
 whopr partitioning                 :   1.45 (  3%)   0.02 (  0%)   1.48 (  2%) 
   5451 kB (  0%)
 ipa icf                            :   2.71 (  5%)   0.07 (  1%)   2.76 (  4%) 
  12571 kB (  1%)
 TOTAL                              :  52.15          9.62         61.86        
1413731 kB

 - we may be in position to look for faster compression library (to save 6% of 
WPA)
 - icf and profile merging still brings in too many function bodies (to save 
12% of GGC memory)
 - inliner got slower. Reason is twofold. It now spends about 15% in the
   hashtable mapping summaries to symbol nodes (we used to have an array which
   was removed by Martin) and we do spend a lot of time in sreal computation.
   This can be microoptimized + I have some patches to speed it up noticeably
   by getting functions contextes handled better.
 - I have noticed that ltrans spends absurt amount of time in 
   lookup_external_ref (up to 20% in large partitions) which may affect the 
table
   above in favour of more partitioning.

Still we could get important wins by reducing amount of decl streaming
(I will do some tests on simplifing function types, arrays and enums to see
if there is low hanging fruit left) but we do a lot better than ever brefore.

Bootstrapped/regtested x86_64-linux, comitted.

Honza


        * params.def (lto-partitions): Set to 128 (instead of 32).
Index: params.def
===================================================================
--- params.def  (revision 265573)
+++ params.def  (working copy)
@@ -1103,7 +1103,7 @@ DEFPARAM (PARAM_IPA_MAX_AA_STEPS,
 DEFPARAM (PARAM_LTO_PARTITIONS,
          "lto-partitions",
          "Number of partitions the program should be split to.",
-         32, 1, 0)
+         128, 1, 0)
 
 DEFPARAM (MIN_PARTITION_SIZE,
          "lto-min-partition",

LTO partitioning performance & increase default number of partitions

Reply via email to