Hi, > On 13 Jun 2025, at 7:21 pm, Jan Hubicka <hubi...@ucw.cz> wrote: > > External email: Use caution opening links or attachments > > >> From: Dhruv Chawla <dhr...@nvidia.com> > Hi, >> >> For reasons explained in the patch, this patch prevents the loss of profile >> information when inlining occurs in the profiled binary but not in the >> auto-profile pass as a decision. As an example, for this code: > > I was wondering about this problem too >> - Annotation, merging and inlining form a messy set of dependencies in >> the auto-profile pass. The order that functions get annotated in >> affects the decisions that the inliner makes, but the order of >> visiting them is effectively random due to the use of >> FOR_EACH_FUNCTION. >> >> - The main issue is that annotation is performed after inlining. This is >> meant to more accurately mirror the hot path in the profiled binary, >> however there is no guarantee of this because of the randomness in the >> order of visitation. > I tought the extra early inlining invocation just queries the AFDO data, > not annotated function body (i.e. is done inter-procedurally all before > the annotation starts). I think what Dhruv is talking here is abut the case say, foo calls bar and it is inlined in the profiled binary. However, AFDO decided it is not hot enough to inline. In this case we want to merge the inlined profile back and annotate to bar. If we call annotate_cfg to bar before we do to foo, we would have annotated foo before we merged it back. This is the ordering issue here.
I am not sure we want to keep track of functions that had profile merges and do a second [ass of ammaoation for them at the end, > > I.e. we do > 1) read afdo gcov file > 2) do regular early optimizations > 3) do the extra early-inliner invocatoin of afdo pass > 4) annotate CFG > > Early inliner uses afdo_callsite_hot_enough_for_early_inline which uses > autofdo::afdo_source_profile->get_callsite_total_count which does: > > gcov_type > autofdo_source_profile::get_callsite_total_count ( > struct cgraph_edge *edge) const > { > inline_stack stack; > stack.safe_push (std::make_pair (edge->callee->decl, 0)); > get_inline_stack (gimple_location (edge->call_stmt), &stack); > > function_instance *s = get_function_instance_by_inline_stack (stack); > if (s == NULL > ||(afdo_string_table->get_index_by_decl (edge->callee->decl) > != s->name())) > return 0; > > return s->total_count (); > } > > I think this should return the sum of all counts in profile of the > inline instance and transitively everything inlined in it without > actually looking at the CFG profile computed by annotation later. > > Where we have the dependency? > > However bigger problem is with LTO where we can have inline instance > coming from different unit. In this case early inlining can not > succeed. > > Also we have two early inliners with AFDO. The usual one that is > done during early optimization and we re-run it within autofdo pass. > I think the second is to handle indirect calls, but I wonder if that can > be done during early opts as well. Or if we want to re-do early opts > after this inlining possibly to get better match of optimized cfg we > have afdo data for and cfg we see. >> >> - Consider the following example: >> >> int foo () { <...> } >> int bar_1 () { <...> foo (); <..> } >> int bar_2 () { <...> foo (); <..> } >> int bar_3 () { <...> foo (); <..> } >> >> If foo was always inlined in all three bar_<n> functions, the profile >> information will contain inline callsites for all bar_<n> functions. >> There will be no separate profile information for foo in the GCOV file. >> If auto-profile visits them in the order bar_1 -> foo -> bar_2 -> >> bar_3, it is possible that inlining could fail in bar_1 because foo >> would not have any profile counts associated with it. If foo was > > I tought this should be covered by afdo_callsite_hot_enough_for_early_inline > at least when everything is build with -fno-lto? > > We definitely have problem in cases we do not decide to early inlining > or with LTO, so you changes makes sense, just I am trying to understand > what exactly we are seeing here. Main motivation here is to preserve the profile as much as we can. Currently we loose this. > >> visited first, then that decision could change. This non-determinism >> raises the question of splitting out: >> >> 1. Merging inline callsites into outline copies >> 2. Annotating functions >> 3. Inlining callsites > > I think it can be > 0. do the afdo guided auto inline (ideally during early optimization) > 1. Merging inline callsites which was not inlined in 0 into outline copies > 2. Annotating functions > 3. Inlining callsites > Yes, this is what we were thinking of too. > Note that lnt tester is finally up and running > https://lnt.opensuse.org/db_default/v4/SPEC/67393 > thanks to Petr Hodac :) >> >> As separate phases in auto-profile, where each effectively executes as a >> sub-pass. As modification of the cgraph is only done in 3., the order of >> visiting functions, at least in 1. and 2., should not matter. Does this >> sound okay? >> >> Splitting out inlining as its own phase also means that it can >> eventually be handed off to ipa-inline to handle, thus making >> auto-profile independent of early inline. This will simplify the code a >> fair bit. Is this a good direction to go in? > > I think dropping the logic of inlining early and applying profile to > inlined instances is actually going to lose quite some of precision, > since inline instances are quite specialized and have different CFGs. > But we definitely ought to handle the case where inlining failed. I agree. Is that something we can aim for when we have accurate profile representation in IR and inline that can work with AFDO profile? Thanks, Kugan > > Also notice that head_count is missing for inline instances... > > Honza >> >> Bootstrapped and regtested on aarch64-linux-gnu. >> >> Dhruv Chawla (1): >> [RFC][AutoFDO] Propagate information to outline copies if not inlined >> >> gcc/auto-profile.cc | 72 +++++++++++++++++++++++++++++++++++++++------ >> 1 file changed, 63 insertions(+), 9 deletions(-) >> >> -- >> 2.44.0 >>