> On 15 Apr 2025, at 15:42, Richard Biener <richard.guent...@gmail.com> wrote: > > On Mon, Apr 14, 2025 at 3:11 PM Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >> >> Hi Honza, >> >>> On 13 Apr 2025, at 23:19, Jan Hubicka <hubi...@ucw.cz> wrote: >>> >>>> +@opindex fipa-reorder-for-locality >>>> +@item -fipa-reorder-for-locality >>>> +Group call chains close together in the binary layout to improve code code >>>> +locality. This option is incompatible with an explicit >>>> +@option{-flto-partition=} option since it enforces a custom partitioning >>>> +scheme. >>> >>> Please also cross-link this with -fprofile-reorder-functions and >>> -freorder-functions, which does similar thing. >>> If you see how to clean-up the description of the other two so user is >>> not confused. >>> >>> Perhaps say that -freorder-functions only partitions functions into >>> never-executed/cold/normal/hot and -fprofile-reroder-functions is aiming >>> for program startup optimization (it reorders by measured first time the >>> function is executed. By accident it seems to kind of work for >>> locality. >> >> Yeah, the option names are quite similar aren't they? >> I’ve attempted to disambiguate them a bit in their description. >> I’m attaching a diff from the previous version (as the full updated patch) >> to make it easier to see what’s adjusted. >> >> >>> >>>> + >>>> +/* Helper function of to accumulate call counts. */ >>>> +static bool >>>> +accumulate_profile_counts_after_cloning (cgraph_node *node, void *data) >>>> +{ >>>> + struct profile_stats *stats = (struct profile_stats *) data; >>>> + for (cgraph_edge *e = node->callers; e; e = e->next_caller) >>>> + { >>>> + if (e->caller == stats->target) >>>> + { >>>> + if (stats->rec_count.compatible_p (e->count.ipa ())) >>>> + stats->rec_count += e->count.ipa (); >>>> + } >>>> + else >>>> + { >>>> + if (stats->nonrec_count.compatible_p (e->count.ipa ())) >>>> + stats->nonrec_count += e->count.ipa (); >>>> + } >>> In case part of profile is missing (which may happen if one unit has -O0 >>> or so) , we may have counts to be uninitialized. Uninitialized counts are >>> compatible with everything, but any arithmetics with it will produce >>> uninitialized result which will likely confuse code later. So I would >>> skip edges with uninitialized counts. >>> >>> On the other hand ipa counts are always compatible, so compatible_p >>> should be redundat. Main reaosn for existence of compatible_p is that we >>> can have local profiles that are 0 or unknown at IPA level. The ipa () >>> conversion turns all counts into IPA counts and those are compatible >>> with each other. >>> >>> I suppose compatibe_p test is there since the code ICEd in past,but I >>> think it was because of missing ipa() conversion. >>> >>> >>>> + } >>>> + return false; >>>> +} >>>> + >>>> +/* NEW_NODE is a previously created clone of ORIG_NODE already present in >>>> + current partition. EDGES contains newly redirected edges to NEW_NODE. >>>> + Adjust profile information for both nodes and the edge. */ >>>> + >>>> +static void >>>> +adjust_profile_info_for_non_self_rec_edges (auto_vec<cgraph_edge *> >>>> &edges, >>>> + cgraph_node *new_node, >>>> + cgraph_node *orig_node) >>>> +{ >>>> + profile_count orig_node_count = orig_node->count.ipa (); >>>> + profile_count edge_count = profile_count::zero (); >>>> + profile_count final_new_count = profile_count::zero (); >>>> + profile_count final_orig_count = profile_count::zero (); >>>> + >>>> + for (unsigned i = 0; i < edges.length (); ++i) >>>> + edge_count += edges[i]->count.ipa (); >>> Here I would again skip uninitialized. It is probably legal for -O0 >>> function to end up in partition. >>>> + >>>> + final_orig_count = orig_node_count - edge_count; >>>> + >>>> + /* NEW_NODE->count was adjusted for other callers when the clone was >>>> + first created. Just add the new edge count. */ >>>> + if (new_node->count.compatible_p (edge_count)) >>>> + final_new_count = new_node->count + edge_count; >>> And here compatible_p should be unnecesary. >>>> +/* Accumulate frequency of all edges from EDGE->caller to EDGE->callee. >>>> */ >>>> + >>>> +static sreal >>>> +accumulate_incoming_edge_frequency (cgraph_edge *edge) >>>> +{ >>>> + sreal count = 0; >>>> + struct cgraph_edge *e; >>>> + for (e = edge->callee->callers; e; e = e->next_caller) >>>> + { >>>> + /* Make a local decision about all edges for EDGE->caller but not >>>> the >>>> + other nodes already in the partition. Their edges will be visited >>>> + later or may have been visited before and not fit the >>>> + cut-off criteria. */ >>>> + if (e->caller == edge->caller) >>>> + { >>>> + profile_count caller_count = e->caller->inlined_to >>>> + ? e->caller->inlined_to->count >>>> + : e->caller->count; >>>> + if (e->count.compatible_p (caller_count)) >>> Here again compatiblity check should not be necessary, since the counts >>> belong to one function body (after inlining) and should be compatible. >>> inliner calls e->sreal_frequency all the time withotu further checks. >>> >> >> Yeah, I’ve adjusted these uses and used checks for initialized_p where you >> highlighted. >> Indeed it seems to work with a profiled bootstrap and the ICEs we were >> seeing earlier in development aren’t appearing. >> >>> Patch is OK with these changes. I apologize for letting the review slip >>> for so long. I was setting up Firefox testing and LLVM builds to gather >>> some data that took longer than I hoped for. On Firefox and LLVM on zen >>> I can measure some improvements via instruction cache perofmrance >>> counters, but the overall benefit seems to be close to noise, but this >>> is likely very CPU specfic. Overall code locality is one of main missing >>> parts of the LTO framework. As discussed on Cauldron, I think next >>> stage1 we can untie this from partitining algorithm, but that needs more >>> work on linker side as well as on gas fragments, so I think it is a good >>> idea to move with this patch as it is and improve from it. >> >> Thank you very much for the evaluation and the feedback! I agree the effect >> can be very CPU-specific. >> >>> >>> I think the patch is modifying almost no code that is run w/o the >>> -fipa-reorder-for-locality so I hope it is safe for 15.1, but I would >>> like to let Richi and Jakub to comment on this. >> >> Thanks a lot for your reviews yet again. They were very helpful. >> I’ve updated the patch as per your suggestion and did a profiled lto >> bootstrap with the new bootstrap-lto-locality.mk that exercises this. >> All looks good on aarch64-none-linux-gnu. >> Richi, Jakub, is it okay for trunk now given Honza’s comments? > > OK, prepare to back out if there are any issues though. > > Quickly skimming over the patch shows > > +/* Locality partitions, assigns nodes to partitions. These are used later in > + WPA partitioning. */ > +vec<locality_partition> locality_partitions; > + > +/* Map from original node to its latest clone. Gets overwritten whenever a > new > + clone is created from the same node. */ > +hash_map<cgraph_node *, cgraph_node *> node_to_clone; > +/* Map from clone to its original node. */ > +hash_map<cgraph_node *, cgraph_node *> clone_to_node; > > those global CTORs are frowned upon (why are those not static?), we prefer > those to be pointers. They are also not marked for garbage collection > but cgraph_node generally are so I assume they are only live through the > IPA pass itself. So they should be ideally created in a more local scope. >
Thanks! As discussed on IRC we’ll work on addressing these. I’ve pushed the patch with 6d9fdf4bf57 after one more bootstrap on aarch64-none-linux-gnu. Kyrill > Richard. > >> >> >>> >>> Honza >>