On Wed, Apr 16, 2025 at 11:08 AM xionghuluo <yinyuefen...@gmail.com> wrote:
>
> Hi, the bootstrap-lto-locality is much longer compared to boostrap-lto
> and bootstrap, and
>
> It seems that stage2 and stage3 only produced 5 partitions in LTO, is
> this reasonable...

Likely due to the high default of

-param=lto-max-locality-partition=
Common Joined UInteger Var(param_max_locality_partition_size)
Init(1000000) Param
Maximal size of a locality partition for LTO (in estimated
instructions). Value of 0 results in default value being used.

or the failure to apply the smaller param_min_partition_size to fill
up param_lto_partitions (128).

> Also could you please inform how much is the exact performance gain, please?
>
>
> make bootstrap:                                       27m56.054s
> make BUILD_CONFIG=bootstrap-lto:         38m25.048s
> make BUILD_CONFIG=bootstrap-lto-locality:    71m1.882s
>
>
> On 2025/4/15 22:38, Kyrylo Tkachov wrote:
> >
> >> On 15 Apr 2025, at 15:42, Richard Biener <richard.guent...@gmail.com> 
> >> wrote:
> >>
> >> On Mon, Apr 14, 2025 at 3:11 PM Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
> >>> Hi Honza,
> >>>
> >>>> On 13 Apr 2025, at 23:19, Jan Hubicka <hubi...@ucw.cz> wrote:
> >>>>
> >>>>> +@opindex fipa-reorder-for-locality
> >>>>> +@item -fipa-reorder-for-locality
> >>>>> +Group call chains close together in the binary layout to improve code 
> >>>>> code
> >>>>> +locality.  This option is incompatible with an explicit
> >>>>> +@option{-flto-partition=} option since it enforces a custom 
> >>>>> partitioning
> >>>>> +scheme.
> >>>> Please also cross-link this with -fprofile-reorder-functions and
> >>>> -freorder-functions, which does similar thing.
> >>>> If you see how to clean-up the description of the other two so user is
> >>>> not confused.
> >>>>
> >>>> Perhaps say that -freorder-functions only partitions functions into
> >>>> never-executed/cold/normal/hot and -fprofile-reroder-functions is aiming
> >>>> for program startup optimization (it reorders by measured first time the
> >>>> function is executed.  By accident it seems to kind of work for
> >>>> locality.
> >>> Yeah, the option names are quite similar aren't they?
> >>> I’ve attempted to disambiguate them a bit in their description.
> >>> I’m attaching a diff from the previous version (as the full updated 
> >>> patch) to make it easier to see what’s adjusted.
> >>>
> >>>
> >>>>> +
> >>>>> +/* Helper function of to accumulate call counts.  */
> >>>>> +static bool
> >>>>> +accumulate_profile_counts_after_cloning (cgraph_node *node, void *data)
> >>>>> +{
> >>>>> +  struct profile_stats *stats = (struct profile_stats *) data;
> >>>>> +  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
> >>>>> +    {
> >>>>> +      if (e->caller == stats->target)
> >>>>> + {
> >>>>> +  if (stats->rec_count.compatible_p (e->count.ipa ()))
> >>>>> +    stats->rec_count += e->count.ipa ();
> >>>>> + }
> >>>>> +      else
> >>>>> + {
> >>>>> +  if (stats->nonrec_count.compatible_p (e->count.ipa ()))
> >>>>> +    stats->nonrec_count += e->count.ipa ();
> >>>>> + }
> >>>> In case part of profile is missing (which may happen if one unit has -O0
> >>>> or so) , we may have counts to be uninitialized. Uninitialized counts are
> >>>> compatible with everything, but any arithmetics with it will produce
> >>>> uninitialized result which will likely confuse code later.  So I would
> >>>> skip edges with uninitialized counts.
> >>>>
> >>>> On the other hand ipa counts are always compatible, so compatible_p
> >>>> should be redundat. Main reaosn for existence of compatible_p is that we
> >>>> can have local profiles that are 0 or unknown at IPA level.  The ipa ()
> >>>> conversion turns all counts into IPA counts and those are compatible
> >>>> with each other.
> >>>>
> >>>> I suppose compatibe_p test is there since the code ICEd in past,but I
> >>>> think it was because of missing ipa() conversion.
> >>>>
> >>>>
> >>>>> +    }
> >>>>> +  return false;
> >>>>> +}
> >>>>> +
> >>>>> +/* NEW_NODE is a previously created clone of ORIG_NODE already present 
> >>>>> in
> >>>>> +   current partition.  EDGES contains newly redirected edges to 
> >>>>> NEW_NODE.
> >>>>> +   Adjust profile information for both nodes and the edge.  */
> >>>>> +
> >>>>> +static void
> >>>>> +adjust_profile_info_for_non_self_rec_edges (auto_vec<cgraph_edge *> 
> >>>>> &edges,
> >>>>> +    cgraph_node *new_node,
> >>>>> +    cgraph_node *orig_node)
> >>>>> +{
> >>>>> +  profile_count orig_node_count = orig_node->count.ipa ();
> >>>>> +  profile_count edge_count = profile_count::zero ();
> >>>>> +  profile_count final_new_count = profile_count::zero ();
> >>>>> +  profile_count final_orig_count = profile_count::zero ();
> >>>>> +
> >>>>> +  for (unsigned i = 0; i < edges.length (); ++i)
> >>>>> +    edge_count += edges[i]->count.ipa ();
> >>>> Here I would again skip uninitialized.  It is probably legal for -O0
> >>>> function to end up in partition.
> >>>>> +
> >>>>> +  final_orig_count = orig_node_count - edge_count;
> >>>>> +
> >>>>> +  /* NEW_NODE->count was adjusted for other callers when the clone was
> >>>>> +     first created.  Just add the new edge count.  */
> >>>>> +  if (new_node->count.compatible_p (edge_count))
> >>>>> +    final_new_count = new_node->count + edge_count;
> >>>> And here compatible_p should be unnecesary.
> >>>>> +/* Accumulate frequency of all edges from EDGE->caller to 
> >>>>> EDGE->callee.  */
> >>>>> +
> >>>>> +static sreal
> >>>>> +accumulate_incoming_edge_frequency (cgraph_edge *edge)
> >>>>> +{
> >>>>> +  sreal count = 0;
> >>>>> +  struct cgraph_edge *e;
> >>>>> +  for (e = edge->callee->callers; e; e = e->next_caller)
> >>>>> +    {
> >>>>> +      /* Make a local decision about all edges for EDGE->caller but 
> >>>>> not the
> >>>>> + other nodes already in the partition.  Their edges will be visited
> >>>>> + later or may have been visited before and not fit the
> >>>>> + cut-off criteria.  */
> >>>>> +      if (e->caller == edge->caller)
> >>>>> + {
> >>>>> +  profile_count caller_count = e->caller->inlined_to
> >>>>> + ? e->caller->inlined_to->count
> >>>>> + : e->caller->count;
> >>>>> +  if (e->count.compatible_p (caller_count))
> >>>> Here again compatiblity check should not be necessary, since the counts
> >>>> belong to one function body (after inlining) and should be compatible.
> >>>> inliner calls e->sreal_frequency all the time withotu further checks.
> >>>>
> >>> Yeah, I’ve adjusted these uses and used checks for initialized_p where 
> >>> you highlighted.
> >>> Indeed it seems to work with a profiled bootstrap and the ICEs we were 
> >>> seeing earlier in development aren’t appearing.
> >>>
> >>>> Patch is OK with these changes. I apologize for letting the review slip
> >>>> for so long.  I was setting up Firefox testing and LLVM builds to gather
> >>>> some data that took longer than I hoped for.  On Firefox and LLVM on zen
> >>>> I can measure some improvements via instruction cache perofmrance
> >>>> counters, but the overall benefit seems to be close to noise, but this
> >>>> is likely very CPU specfic. Overall code locality is one of main missing
> >>>> parts of the LTO framework.  As discussed on Cauldron, I think next
> >>>> stage1 we can untie this from partitining algorithm, but that needs more
> >>>> work on linker side as well as on gas fragments, so I think it is a good
> >>>> idea to move with this patch as it is and improve from it.
> >>> Thank you very much for the evaluation and the feedback! I agree the 
> >>> effect can be very CPU-specific.
> >>>
> >>>> I think the patch is modifying almost no code that is run w/o the
> >>>> -fipa-reorder-for-locality so I hope it is safe for 15.1, but I would
> >>>> like to let Richi and Jakub to comment on this.
> >>> Thanks a lot for your reviews yet again. They were very helpful.
> >>> I’ve updated the patch as per your suggestion and did a profiled lto 
> >>> bootstrap with the new bootstrap-lto-locality.mk that exercises this.
> >>> All looks good on aarch64-none-linux-gnu.
> >>> Richi, Jakub, is it okay for trunk now given Honza’s comments?
> >> OK, prepare to back out if there are any issues though.
> >>
> >> Quickly skimming over the patch shows
> >>
> >> +/* Locality partitions, assigns nodes to partitions.  These are used 
> >> later in
> >> +   WPA partitioning.  */
> >> +vec<locality_partition> locality_partitions;
> >> +
> >> +/* Map from original node to its latest clone.  Gets overwritten whenever 
> >> a new
> >> +   clone is created from the same node.  */
> >> +hash_map<cgraph_node *, cgraph_node *> node_to_clone;
> >> +/* Map from clone to its original node.  */
> >> +hash_map<cgraph_node *, cgraph_node *> clone_to_node;
> >>
> >> those global CTORs are frowned upon (why are those not static?), we prefer
> >> those to be pointers.  They are also not marked for garbage collection
> >> but cgraph_node generally are so I assume they are only live through the
> >> IPA pass itself.  So they should be ideally created in a more local scope.
> >>
> > Thanks! As discussed on IRC we’ll work on addressing these. I’ve pushed the 
> > patch with 6d9fdf4bf57 after one more bootstrap on aarch64-none-linux-gnu.
> > Kyrill
> >
> >
> >> Richard.
> >>
> >>>
> >>>> Honza

Reply via email to