On Wed, Apr 16, 2025 at 11:08 AM xionghuluo <yinyuefen...@gmail.com> wrote: > > Hi, the bootstrap-lto-locality is much longer compared to boostrap-lto > and bootstrap, and > > It seems that stage2 and stage3 only produced 5 partitions in LTO, is > this reasonable...
Likely due to the high default of -param=lto-max-locality-partition= Common Joined UInteger Var(param_max_locality_partition_size) Init(1000000) Param Maximal size of a locality partition for LTO (in estimated instructions). Value of 0 results in default value being used. or the failure to apply the smaller param_min_partition_size to fill up param_lto_partitions (128). > Also could you please inform how much is the exact performance gain, please? > > > make bootstrap: 27m56.054s > make BUILD_CONFIG=bootstrap-lto: 38m25.048s > make BUILD_CONFIG=bootstrap-lto-locality: 71m1.882s > > > On 2025/4/15 22:38, Kyrylo Tkachov wrote: > > > >> On 15 Apr 2025, at 15:42, Richard Biener <richard.guent...@gmail.com> > >> wrote: > >> > >> On Mon, Apr 14, 2025 at 3:11 PM Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > >>> Hi Honza, > >>> > >>>> On 13 Apr 2025, at 23:19, Jan Hubicka <hubi...@ucw.cz> wrote: > >>>> > >>>>> +@opindex fipa-reorder-for-locality > >>>>> +@item -fipa-reorder-for-locality > >>>>> +Group call chains close together in the binary layout to improve code > >>>>> code > >>>>> +locality. This option is incompatible with an explicit > >>>>> +@option{-flto-partition=} option since it enforces a custom > >>>>> partitioning > >>>>> +scheme. > >>>> Please also cross-link this with -fprofile-reorder-functions and > >>>> -freorder-functions, which does similar thing. > >>>> If you see how to clean-up the description of the other two so user is > >>>> not confused. > >>>> > >>>> Perhaps say that -freorder-functions only partitions functions into > >>>> never-executed/cold/normal/hot and -fprofile-reroder-functions is aiming > >>>> for program startup optimization (it reorders by measured first time the > >>>> function is executed. By accident it seems to kind of work for > >>>> locality. > >>> Yeah, the option names are quite similar aren't they? > >>> I’ve attempted to disambiguate them a bit in their description. > >>> I’m attaching a diff from the previous version (as the full updated > >>> patch) to make it easier to see what’s adjusted. > >>> > >>> > >>>>> + > >>>>> +/* Helper function of to accumulate call counts. */ > >>>>> +static bool > >>>>> +accumulate_profile_counts_after_cloning (cgraph_node *node, void *data) > >>>>> +{ > >>>>> + struct profile_stats *stats = (struct profile_stats *) data; > >>>>> + for (cgraph_edge *e = node->callers; e; e = e->next_caller) > >>>>> + { > >>>>> + if (e->caller == stats->target) > >>>>> + { > >>>>> + if (stats->rec_count.compatible_p (e->count.ipa ())) > >>>>> + stats->rec_count += e->count.ipa (); > >>>>> + } > >>>>> + else > >>>>> + { > >>>>> + if (stats->nonrec_count.compatible_p (e->count.ipa ())) > >>>>> + stats->nonrec_count += e->count.ipa (); > >>>>> + } > >>>> In case part of profile is missing (which may happen if one unit has -O0 > >>>> or so) , we may have counts to be uninitialized. Uninitialized counts are > >>>> compatible with everything, but any arithmetics with it will produce > >>>> uninitialized result which will likely confuse code later. So I would > >>>> skip edges with uninitialized counts. > >>>> > >>>> On the other hand ipa counts are always compatible, so compatible_p > >>>> should be redundat. Main reaosn for existence of compatible_p is that we > >>>> can have local profiles that are 0 or unknown at IPA level. The ipa () > >>>> conversion turns all counts into IPA counts and those are compatible > >>>> with each other. > >>>> > >>>> I suppose compatibe_p test is there since the code ICEd in past,but I > >>>> think it was because of missing ipa() conversion. > >>>> > >>>> > >>>>> + } > >>>>> + return false; > >>>>> +} > >>>>> + > >>>>> +/* NEW_NODE is a previously created clone of ORIG_NODE already present > >>>>> in > >>>>> + current partition. EDGES contains newly redirected edges to > >>>>> NEW_NODE. > >>>>> + Adjust profile information for both nodes and the edge. */ > >>>>> + > >>>>> +static void > >>>>> +adjust_profile_info_for_non_self_rec_edges (auto_vec<cgraph_edge *> > >>>>> &edges, > >>>>> + cgraph_node *new_node, > >>>>> + cgraph_node *orig_node) > >>>>> +{ > >>>>> + profile_count orig_node_count = orig_node->count.ipa (); > >>>>> + profile_count edge_count = profile_count::zero (); > >>>>> + profile_count final_new_count = profile_count::zero (); > >>>>> + profile_count final_orig_count = profile_count::zero (); > >>>>> + > >>>>> + for (unsigned i = 0; i < edges.length (); ++i) > >>>>> + edge_count += edges[i]->count.ipa (); > >>>> Here I would again skip uninitialized. It is probably legal for -O0 > >>>> function to end up in partition. > >>>>> + > >>>>> + final_orig_count = orig_node_count - edge_count; > >>>>> + > >>>>> + /* NEW_NODE->count was adjusted for other callers when the clone was > >>>>> + first created. Just add the new edge count. */ > >>>>> + if (new_node->count.compatible_p (edge_count)) > >>>>> + final_new_count = new_node->count + edge_count; > >>>> And here compatible_p should be unnecesary. > >>>>> +/* Accumulate frequency of all edges from EDGE->caller to > >>>>> EDGE->callee. */ > >>>>> + > >>>>> +static sreal > >>>>> +accumulate_incoming_edge_frequency (cgraph_edge *edge) > >>>>> +{ > >>>>> + sreal count = 0; > >>>>> + struct cgraph_edge *e; > >>>>> + for (e = edge->callee->callers; e; e = e->next_caller) > >>>>> + { > >>>>> + /* Make a local decision about all edges for EDGE->caller but > >>>>> not the > >>>>> + other nodes already in the partition. Their edges will be visited > >>>>> + later or may have been visited before and not fit the > >>>>> + cut-off criteria. */ > >>>>> + if (e->caller == edge->caller) > >>>>> + { > >>>>> + profile_count caller_count = e->caller->inlined_to > >>>>> + ? e->caller->inlined_to->count > >>>>> + : e->caller->count; > >>>>> + if (e->count.compatible_p (caller_count)) > >>>> Here again compatiblity check should not be necessary, since the counts > >>>> belong to one function body (after inlining) and should be compatible. > >>>> inliner calls e->sreal_frequency all the time withotu further checks. > >>>> > >>> Yeah, I’ve adjusted these uses and used checks for initialized_p where > >>> you highlighted. > >>> Indeed it seems to work with a profiled bootstrap and the ICEs we were > >>> seeing earlier in development aren’t appearing. > >>> > >>>> Patch is OK with these changes. I apologize for letting the review slip > >>>> for so long. I was setting up Firefox testing and LLVM builds to gather > >>>> some data that took longer than I hoped for. On Firefox and LLVM on zen > >>>> I can measure some improvements via instruction cache perofmrance > >>>> counters, but the overall benefit seems to be close to noise, but this > >>>> is likely very CPU specfic. Overall code locality is one of main missing > >>>> parts of the LTO framework. As discussed on Cauldron, I think next > >>>> stage1 we can untie this from partitining algorithm, but that needs more > >>>> work on linker side as well as on gas fragments, so I think it is a good > >>>> idea to move with this patch as it is and improve from it. > >>> Thank you very much for the evaluation and the feedback! I agree the > >>> effect can be very CPU-specific. > >>> > >>>> I think the patch is modifying almost no code that is run w/o the > >>>> -fipa-reorder-for-locality so I hope it is safe for 15.1, but I would > >>>> like to let Richi and Jakub to comment on this. > >>> Thanks a lot for your reviews yet again. They were very helpful. > >>> I’ve updated the patch as per your suggestion and did a profiled lto > >>> bootstrap with the new bootstrap-lto-locality.mk that exercises this. > >>> All looks good on aarch64-none-linux-gnu. > >>> Richi, Jakub, is it okay for trunk now given Honza’s comments? > >> OK, prepare to back out if there are any issues though. > >> > >> Quickly skimming over the patch shows > >> > >> +/* Locality partitions, assigns nodes to partitions. These are used > >> later in > >> + WPA partitioning. */ > >> +vec<locality_partition> locality_partitions; > >> + > >> +/* Map from original node to its latest clone. Gets overwritten whenever > >> a new > >> + clone is created from the same node. */ > >> +hash_map<cgraph_node *, cgraph_node *> node_to_clone; > >> +/* Map from clone to its original node. */ > >> +hash_map<cgraph_node *, cgraph_node *> clone_to_node; > >> > >> those global CTORs are frowned upon (why are those not static?), we prefer > >> those to be pointers. They are also not marked for garbage collection > >> but cgraph_node generally are so I assume they are only live through the > >> IPA pass itself. So they should be ideally created in a more local scope. > >> > > Thanks! As discussed on IRC we’ll work on addressing these. I’ve pushed the > > patch with 6d9fdf4bf57 after one more bootstrap on aarch64-none-linux-gnu. > > Kyrill > > > > > >> Richard. > >> > >>> > >>>> Honza