Re: [PATCH] Locality cloning pass (was: Introduce -flto-partition=locality)

Jan Hubicka Sun, 13 Apr 2025 19:23:54 -0700

> +@opindex fipa-reorder-for-locality
> +@item -fipa-reorder-for-locality
> +Group call chains close together in the binary layout to improve code code
> +locality.  This option is incompatible with an explicit
> +@option{-flto-partition=} option since it enforces a custom partitioning
> +scheme.


Please also cross-link this with -fprofile-reorder-functions and
-freorder-functions, which does similar thing.
If you see how to clean-up the description of the other two so user is
not confused. 

Perhaps say that -freorder-functions only partitions functions into
never-executed/cold/normal/hot and -fprofile-reroder-functions is aiming
for program startup optimization (it reorders by measured first time the
function is executed.  By accident it seems to kind of work for
locality.

> +
> +/* Helper function of to accumulate call counts.  */
> +static bool
> +accumulate_profile_counts_after_cloning (cgraph_node *node, void *data)
> +{
> +  struct profile_stats *stats = (struct profile_stats *) data;
> +  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
> +    {
> +      if (e->caller == stats->target)
> +     {
> +       if (stats->rec_count.compatible_p (e->count.ipa ()))
> +         stats->rec_count += e->count.ipa ();
> +     }
> +      else
> +     {
> +       if (stats->nonrec_count.compatible_p (e->count.ipa ()))
> +         stats->nonrec_count += e->count.ipa ();
> +     }
In case part of profile is missing (which may happen if one unit has -O0
or so) , we may have counts to be uninitialized. Uninitialized counts are
compatible with everything, but any arithmetics with it will produce
uninitialized result which will likely confuse code later.  So I would
skip edges with uninitialized counts.

On the other hand ipa counts are always compatible, so compatible_p
should be redundat. Main reaosn for existence of compatible_p is that we
can have local profiles that are 0 or unknown at IPA level.  The ipa ()
conversion turns all counts into IPA counts and those are compatible
with each other.

I suppose compatibe_p test is there since the code ICEd in past,but I
think it was because of missing ipa() conversion.


> +    }
> +  return false;
> +}
> +
> +/* NEW_NODE is a previously created clone of ORIG_NODE already present in
> +   current partition.  EDGES contains newly redirected edges to NEW_NODE.
> +   Adjust profile information for both nodes and the edge.  */
> +
> +static void
> +adjust_profile_info_for_non_self_rec_edges (auto_vec<cgraph_edge *> &edges,
> +                                         cgraph_node *new_node,
> +                                         cgraph_node *orig_node)
> +{
> +  profile_count orig_node_count = orig_node->count.ipa ();
> +  profile_count edge_count = profile_count::zero ();
> +  profile_count final_new_count = profile_count::zero ();
> +  profile_count final_orig_count = profile_count::zero ();
> +
> +  for (unsigned i = 0; i < edges.length (); ++i)
> +    edge_count += edges[i]->count.ipa ();
Here I would again skip uninitialized.  It is probably legal for -O0
function to end up in partition.
> +
> +  final_orig_count = orig_node_count - edge_count;
> +
> +  /* NEW_NODE->count was adjusted for other callers when the clone was
> +     first created.  Just add the new edge count.  */
> +  if (new_node->count.compatible_p (edge_count))
> +    final_new_count = new_node->count + edge_count;
And here compatible_p should be unnecesary.
> +/* Accumulate frequency of all edges from EDGE->caller to EDGE->callee.  */
> +
> +static sreal
> +accumulate_incoming_edge_frequency (cgraph_edge *edge)
> +{
> +  sreal count = 0;
> +  struct cgraph_edge *e;
> +  for (e = edge->callee->callers; e; e = e->next_caller)
> +    {
> +      /* Make a local decision about all edges for EDGE->caller but not the
> +      other nodes already in the partition.  Their edges will be visited
> +      later or may have been visited before and not fit the
> +      cut-off criteria.  */
> +      if (e->caller == edge->caller)
> +     {
> +       profile_count caller_count = e->caller->inlined_to
> +                                     ? e->caller->inlined_to->count
> +                                     : e->caller->count;
> +       if (e->count.compatible_p (caller_count))
Here again compatiblity check should not be necessary, since the counts
belong to one function body (after inlining) and should be compatible.
inliner calls e->sreal_frequency all the time withotu further checks.

Patch is OK with these changes. I apologize for letting the review slip
for so long.  I was setting up Firefox testing and LLVM builds to gather
some data that took longer than I hoped for.  On Firefox and LLVM on zen
I can measure some improvements via instruction cache perofmrance
counters, but the overall benefit seems to be close to noise, but this
is likely very CPU specfic. Overall code locality is one of main missing
parts of the LTO framework.  As discussed on Cauldron, I think next
stage1 we can untie this from partitining algorithm, but that needs more
work on linker side as well as on gas fragments, so I think it is a good
idea to move with this patch as it is and improve from it.

I think the patch is modifying almost no code that is run w/o the
-fipa-reorder-for-locality so I hope it is safe for 15.1, but I would
like to let Richi and Jakub to comment on this.

Honza

Re: [PATCH] Locality cloning pass (was: Introduce -flto-partition=locality)

Reply via email to