From: Dhruv Chawla <dhr...@nvidia.com> For reasons explained in the patch, this patch prevents the loss of profile information when inlining occurs in the profiled binary but not in the auto-profile pass as a decision. As an example, for this code:
#define TRIP 1000000000 #ifdef DO_NOINLINE # define INLINE __attribute__((noinline)) #else # define INLINE __attribute__((always_inline)) #endif INLINE int baz(int x, int y, int z) { if (x < TRIP / 4) { return y + z * 8; } else { return y * z / 2; } } __attribute__((noinline, noipa, optnone)) int passthrough(int x, int y, int z) { return baz(x, y, z); } int main() { for (int i = 0; i < TRIP; i++) { passthrough(i, i + 1, i + 2); } } This test case is first compiled without -DDO_NOINLINE, then the resulting binary is profiled and the profile fed back while compiling with -DDO_NOINLINE. This results in baz having an inline callsite in passthrough in the GCOV but no inlining in the FDO binary. Compiling this with and without the patch gives the following .afdo dumps: - With the patch: __attribute__((noinline)) int baz (int x, int y, int z) { int _1; int _2; int _3; int _7; int _8; <bb 2> [count: 534583]: if (x_4(D) <= 249999999) goto <bb 3>; [100.00%] else goto <bb 4>; [0.00%] <bb 3> [count: 534583]: _1 = z_6(D) * 8; _8 = _1 + y_5(D); goto <bb 5>; [100.00%] <bb 4> [count: 0]: _2 = y_5(D) * z_6(D); _7 = _2 / 2; <bb 5> [count: 534583]: # _3 = PHI <_8(3), _7(4)> return _3; } - Without the patch: __attribute__((noinline)) int baz (int x, int y, int z) { int _1; int _2; int _3; int _7; int _8; <bb 2> [local count: 1073741824]: if (x_4(D) <= 249999999) goto <bb 3>; [50.00%] else goto <bb 4>; [50.00%] <bb 3> [local count: 536870912]: _1 = z_6(D) * 8; _8 = _1 + y_5(D); goto <bb 5>; [100.00%] <bb 4> [local count: 536870912]: _2 = y_5(D) * z_6(D); _7 = _2 / 2; <bb 5> [local count: 1073741824]: # _3 = PHI <_8(3), _7(4)> return _3; } Thus the profile counts are lost in this example, without the patch. While developing this patch, a few other points also came up: - Annotation, merging and inlining form a messy set of dependencies in the auto-profile pass. The order that functions get annotated in affects the decisions that the inliner makes, but the order of visiting them is effectively random due to the use of FOR_EACH_FUNCTION. - The main issue is that annotation is performed after inlining. This is meant to more accurately mirror the hot path in the profiled binary, however there is no guarantee of this because of the randomness in the order of visitation. - Consider the following example: int foo () { <...> } int bar_1 () { <...> foo (); <..> } int bar_2 () { <...> foo (); <..> } int bar_3 () { <...> foo (); <..> } If foo was always inlined in all three bar_<n> functions, the profile information will contain inline callsites for all bar_<n> functions. There will be no separate profile information for foo in the GCOV file. If auto-profile visits them in the order bar_1 -> foo -> bar_2 -> bar_3, it is possible that inlining could fail in bar_1 because foo would not have any profile counts associated with it. If foo was visited first, then that decision could change. This non-determinism raises the question of splitting out: 1. Merging inline callsites into outline copies 2. Annotating functions 3. Inlining callsites As separate phases in auto-profile, where each effectively executes as a sub-pass. As modification of the cgraph is only done in 3., the order of visiting functions, at least in 1. and 2., should not matter. Does this sound okay? Splitting out inlining as its own phase also means that it can eventually be handed off to ipa-inline to handle, thus making auto-profile independent of early inline. This will simplify the code a fair bit. Is this a good direction to go in? Bootstrapped and regtested on aarch64-linux-gnu. Dhruv Chawla (1): [RFC][AutoFDO] Propagate information to outline copies if not inlined gcc/auto-profile.cc | 72 +++++++++++++++++++++++++++++++++++++++------ 1 file changed, 63 insertions(+), 9 deletions(-) -- 2.44.0