Re: [PATCH 0/1] [RFC][AutoFDO] Propagate inline information to outline definitions if not inlined

Dhruv Chawla Fri, 13 Jun 2025 02:47:42 -0700

On 13/06/25 14:51, Jan Hubicka wrote:

External email: Use caution opening links or attachments

From: Dhruv Chawla <dhr...@nvidia.com>

Hi,


For reasons explained in the patch, this patch prevents the loss of profile
information when inlining occurs in the profiled binary but not in the
auto-profile pass as a decision. As an example, for this code:


I was wondering about this problem too

- Annotation, merging and inlining form a messy set of dependencies in
   the auto-profile pass. The order that functions get annotated in
   affects the decisions that the inliner makes, but the order of
   visiting them is effectively random due to the use of
   FOR_EACH_FUNCTION.

- The main issue is that annotation is performed after inlining. This is
   meant to more accurately mirror the hot path in the profiled binary,
   however there is no guarantee of this because of the randomness in the
   order of visitation.

I tought the extra early inlining invocation just queries the AFDO data,
not annotated function body (i.e.  is done inter-procedurally all before
the annotation starts).

I.e. we do
1) read afdo gcov file
2) do regular early optimizations
3) do the extra early-inliner invocatoin of afdo pass
4) annotate CFG


Unfortunately not:

auto_profile (void)
{
  <...>
  FOR_EACH_FUNCTION (node)
  {
    <...>
    unsigned int todo = 0;
    for (int i = 0; i < 10; i++)
      {
        if (!flag_value_profile_transformations
            || !autofdo::afdo_vpt_for_early_inline (&promoted_stmts))
          break;
        todo |= early_inline ();
      }

    todo |= early_inline ();
    autofdo::afdo_annotate_cfg (promoted_stmts);
    compute_function_frequency ();

The early inliner is invoked on each function before it is annotated. It
also looks like the pass aggressively tries to do VPT before annotation.


Early inliner uses afdo_callsite_hot_enough_for_early_inline which uses
autofdo::afdo_source_profile->get_callsite_total_count which does:

gcov_type
autofdo_source_profile::get_callsite_total_count (
     struct cgraph_edge *edge) const
{
   inline_stack stack;
   stack.safe_push (std::make_pair (edge->callee->decl, 0));
   get_inline_stack (gimple_location (edge->call_stmt), &stack);

   function_instance *s = get_function_instance_by_inline_stack (stack);
   if (s == NULL
       ||(afdo_string_table->get_index_by_decl (edge->callee->decl)
          != s->name()))
     return 0;

   return s->total_count ();
}

I think this should return the sum of all counts in profile of the
inline instance and transitively everything inlined in it without
actually looking at the CFG profile computed by annotation later.

Where we have the dependency?


Because the early inliner is invoked while annotation is being done,
its possible that all known information has not been propagated to the
total_count of the function_instance when the early inliner is invoked.

Another problem here is that get_inline_stack returns an empty stack if
no inlining occurred in the corresponding GIMPLE statement. So if an
inline callsite does exist in the profile at the current GIMPLE
statement but no inlining actually occurs during auto-profile, the
information is just dropped.


However bigger problem is with LTO where we can have inline instance
coming from different unit.  In this case early inlining can not
succeed.

Also we have two early inliners with AFDO. The usual one that is
done during early optimization and we re-run it within autofdo pass.
I think the second is to handle indirect calls, but I wonder if that can
be done during early opts as well. Or if we want to re-do early opts
after this inlining possibly to get better match of optimized cfg we
have afdo data for and cfg we see.


- Consider the following example:

   int foo () { <...> }
   int bar_1 () { <...> foo (); <..> }
   int bar_2 () { <...> foo (); <..> }
   int bar_3 () { <...> foo (); <..> }

   If foo was always inlined in all three bar_<n> functions, the profile
   information will contain inline callsites for all bar_<n> functions.
   There will be no separate profile information for foo in the GCOV file.
   If auto-profile visits them in the order bar_1 -> foo -> bar_2 ->
   bar_3, it is possible that inlining could fail in bar_1 because foo
   would not have any profile counts associated with it. If foo was


I tought this should be covered by afdo_callsite_hot_enough_for_early_inline
at least when everything is build with -fno-lto?


The problem here is that foo would have no function_instance associated with it
when bar_1 is being processed, because if it doesn't get inlined then the call
to get_inline_stack would return an empty value as noted above and
afdo_callsite_hot_enough_for_early_inline would bail.


We definitely have problem in cases we do not decide to early inlining
or with LTO, so you changes makes sense, just I am trying to understand
what exactly we are seeing here.

   visited first, then that decision could change. This non-determinism
   raises the question of splitting out:

   1. Merging inline callsites into outline copies
   2. Annotating functions
   3. Inlining callsites


I think it can be
     0. do the afdo guided auto inline (ideally during early optimization)
     1. Merging inline callsites which was not inlined in 0 into outline copies
     2. Annotating functions
     3. Inlining callsites


Yeah, that does sound better. So the profile would be read initially, used to 
set
the total_count for each function, then the detailed annotation would be done 
in 2?


Note that lnt tester is finally up and running
https://lnt.opensuse.org/db_default/v4/SPEC/67393
thanks to Petr Hodac :)


Yay, that's really nice :)


   As separate phases in auto-profile, where each effectively executes as a
   sub-pass. As modification of the cgraph is only done in 3., the order of
   visiting functions, at least in 1. and 2., should not matter. Does this
   sound okay?

Splitting out inlining as its own phase also means that it can
eventually be handed off to ipa-inline to handle, thus making
auto-profile independent of early inline. This will simplify the code a
fair bit. Is this a good direction to go in?


I think dropping the logic of inlining early and applying profile to
inlined instances is actually going to lose quite some of precision,
since inline instances are quite specialized and have different CFGs.
But we definitely ought to handle the case where inlining failed.

Also notice that head_count is missing for inline instances...


The point is mainly about removing calls to the inliner from the auto-profile
pass so that it can be deferred to the ipa-inline pass, i.e. point 3. I think
it would still work fine if early-inline could use the afdo data before
auto-profile runs.


Honza


Bootstrapped and regtested on aarch64-linux-gnu.

Dhruv Chawla (1):
   [RFC][AutoFDO] Propagate information to outline copies if not inlined

  gcc/auto-profile.cc | 72 +++++++++++++++++++++++++++++++++++++++------
  1 file changed, 63 insertions(+), 9 deletions(-)

--
2.44.0


--
Regards,
Dhruv

Re: [PATCH 0/1] [RFC][AutoFDO] Propagate inline information to outline definitions if not inlined

Reply via email to