On Wed, Dec 4, 2019 at 10:52 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > Hi, > with recent fixes to proile updating I noticed that we get more regressions > compared to gcc 9 at Firefox testing. This is because Firefox train run is > not covering all the benchmarks and gcc 9, thanks to updating bugs sometimes > optimize code for speed even if it was not trained. > > While in general one should have reasonable train run, in some cases it is > not practical to do so. For example, skia library has optimized vector code > for different ISAs and thus firefox renders quickly only if it is trained > on same CPU as run. > > This patch adds flag -fprofile-partial-training which makes GCC to optimize > untrained functions as w/o -fprofile-use. This nullifies code size > improvements > of FDO but can be used in cases where full training is not quite possible > (and one can use it only on portions of programs). > > Previously only good answer was to disable profiling for a given function, but > that needs to be done quite precisely and in general is hard to arrange. > > Patch works by > 1) not setting PROFILE_READ for functions with entry count 0 > 2) make inliner and ipa-cp to drop profile to local one when all > trained executions are redirected to clones > 3) reduce quality of branch probabilities of branches leading to never > executed regions to GUESSED. This is necessary to prevent gcc from > proagating thins back. > > Bootstrapped/regtested x86_64-linux. I plan to commit it tomorrow if there > are no complains. Feedback is welcome!
I wonder if the behavior shouldn't be the default? The only thing we lose is failing to notice really cold calls (error paths) in programs? Richard. > > Honza > > * cgraphclones.c (localize_profile): New function. > (cgraph_node::create_clone): Use it for partial profiles. > * common.opt (fprofile-partial-training): New flag. > * doc/invoke.texi (-fprofile-partial-training): Document. > * ipa-cp.c (update_profiling_info): For partial profiles do not > set function profile to zero. > * profile.c (compute_branch_probabilities): With partial profile > watch if edge count is zero and turn all probabilities to guessed. > (compute_branch_probabilities): For partial profiles do not apply > profile when entry count is zero. > * tree-profile.c (tree_profiling): Only do > value_profile_transformations > when profile is read. > Index: cgraphclones.c > =================================================================== > --- cgraphclones.c (revision 278944) > +++ cgraphclones.c (working copy) > @@ -307,6 +307,22 @@ dump_callgraph_transformation (const cgr > } > } > > +/* Turn profile of N to local profile. */ > + > +static void > +localize_profile (cgraph_node *n) > +{ > + n->count = n->count.guessed_local (); > + for (cgraph_edge *e = n->callees; e; e=e->next_callee) > + { > + e->count = e->count.guessed_local (); > + if (!e->inline_failed) > + localize_profile (e->callee); > + } > + for (cgraph_edge *e = n->indirect_calls; e; e=e->next_callee) > + e->count = e->count.guessed_local (); > +} > + > /* Create node representing clone of N executed COUNT times. Decrease > the execution counts from original node too. > The new clone will have decl set to DECL that may or may not be the same > @@ -340,6 +356,7 @@ cgraph_node::create_clone (tree new_decl > cgraph_edge *e; > unsigned i; > profile_count old_count = count; > + bool nonzero = count.ipa ().nonzero_p (); > > if (new_inlined_to) > dump_callgraph_transformation (this, new_inlined_to, "inlining to"); > @@ -426,6 +446,15 @@ cgraph_node::create_clone (tree new_decl > > if (call_duplication_hook) > symtab->call_cgraph_duplication_hooks (this, new_node); > + /* With partial train run we do not want to assume that original's > + count is zero whenever we redurect all executed edges to clone. > + Simply drop profile to local one in this case. */ > + if (update_original > + && opt_for_fn (decl, flag_partial_profile_training) > + && nonzero > + && count.ipa_p () > + && !count.ipa ().nonzero_p ()) > + localize_profile (this); > > if (!new_inlined_to) > dump_callgraph_transformation (this, new_node, suffix); > Index: common.opt > =================================================================== > --- common.opt (revision 278944) > +++ common.opt (working copy) > @@ -2160,6 +2160,10 @@ fprofile-generate= > Common Joined RejectNegative > Enable common options for generating profile info for profile feedback > directed optimizations, and set -fprofile-dir=. > > +fprofile-partial-training > +Common Report Var(flag_partial_profile_training) Optimization > +Do not assume that functions never executed during the train run are cold > + > fprofile-use > Common Var(flag_profile_use) > Enable common options for performing profile feedback directed optimizations. > Index: doc/invoke.texi > =================================================================== > --- doc/invoke.texi (revision 278944) > +++ doc/invoke.texi (working copy) > @@ -453,8 +453,8 @@ Objective-C and Objective-C++ Dialects}. > -fpartial-inlining -fpeel-loops -fpredictive-commoning @gol > -fprefetch-loop-arrays @gol > -fprofile-correction @gol > --fprofile-use -fprofile-use=@var{path} -fprofile-values @gol > --fprofile-reorder-functions @gol > +-fprofile-use -fprofile-use=@var{path} -fprofile-partial-training @gol > +-fprofile-values -fprofile-reorder-functions @gol > -freciprocal-math -free -frename-registers -freorder-blocks @gol > -freorder-blocks-algorithm=@var{algorithm} @gol > -freorder-blocks-and-partition -freorder-functions @gol > @@ -10634,6 +10634,17 @@ default, GCC emits an error message when > > This option is enabled by @option{-fauto-profile}. > > +@item -fprofile-partial-training > +@opindex fprofile-use > +With @code{-fprofile-use} all portions of programs not executed during train > +run are optimized agressively for size rather than speed. In some cases it > is not > +practical to train all possible paths hot paths in the program. (For example > +program may contain functions specific for a given hardware and trianing may > +not cover all hardware configurations program is run on.) With > +@code{-fprofile-partial-training} profile feedback will be ignored for all > +functions not executed during the train run leading them to be optimized as > +if they were compiled without profile feedback. > + > @item -fprofile-use > @itemx -fprofile-use=@var{path} > @opindex fprofile-use > Index: ipa-cp.c > =================================================================== > --- ipa-cp.c (revision 278944) > +++ ipa-cp.c (working copy) > @@ -4295,6 +4295,15 @@ update_profiling_info (struct cgraph_nod > > remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa () > - new_sum.ipa ()); > + > + /* With partial train run we do not want to assume that original's > + count is zero whenever we redurect all executed edges to clone. > + Simply drop profile to local one in this case. */ > + if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () > + && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () > + && flag_partial_profile_training) > + remainder = remainder.guessed_local (); > + > new_sum = orig_node_count.combine_with_ipa_count (new_sum); > new_node->count = new_sum; > orig_node->count = remainder; > Index: profile.c > =================================================================== > --- profile.c (revision 278944) > +++ profile.c (working copy) > @@ -635,9 +635,20 @@ compute_branch_probabilities (unsigned c > } > if (bb_gcov_count (bb)) > { > + bool set_to_guessed = false; > FOR_EACH_EDGE (e, ei, bb->succs) > - e->probability = profile_probability::probability_in_gcov_type > - (edge_gcov_count (e), bb_gcov_count (bb)); > + { > + bool prev_never = e->probability == profile_probability::never > (); > + e->probability = profile_probability::probability_in_gcov_type > + (edge_gcov_count (e), bb_gcov_count (bb)); > + if (e->probability == profile_probability::never () > + && !prev_never > + && flag_partial_profile_training) > + set_to_guessed = true; > + } > + if (set_to_guessed) > + FOR_EACH_EDGE (e, ei, bb->succs) > + e->probability = e->probability.guessed (); > if (bb->index >= NUM_FIXED_BLOCKS > && block_ends_with_condjump_p (bb) > && EDGE_COUNT (bb->succs) >= 2) > @@ -697,17 +708,23 @@ compute_branch_probabilities (unsigned c > } > } > > - if (exec_counts) > + if (exec_counts > + && (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun)) > + || !flag_partial_profile_training)) > profile_status_for_fn (cfun) = PROFILE_READ; > > /* If we have real data, use them! */ > if (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun)) > || !flag_guess_branch_prob) > FOR_ALL_BB_FN (bb, cfun) > - bb->count = profile_count::from_gcov_type (bb_gcov_count (bb)); > + if (bb_gcov_count (bb) || !flag_partial_profile_training) > + bb->count = profile_count::from_gcov_type (bb_gcov_count (bb)); > + else > + bb->count = profile_count::guessed_zero (); > /* If function was not trained, preserve local estimates including > statically > determined zero counts. */ > - else if (profile_status_for_fn (cfun) == PROFILE_READ) > + else if (profile_status_for_fn (cfun) == PROFILE_READ > + && !flag_partial_profile_training) > FOR_ALL_BB_FN (bb, cfun) > if (!(bb->count == profile_count::zero ())) > bb->count = bb->count.global0 (); > @@ -1417,7 +1434,7 @@ branch_prob (bool thunk) > /* At this moment we have precise loop iteration count estimates. > Record them to loop structure before the profile gets out of date. */ > FOR_EACH_LOOP (loop, 0) > - if (loop->header->count > 0) > + if (loop->header->count > 0 && loop->header->count.reliable_p ()) > { > gcov_type nit = expected_loop_iterations_unbounded (loop); > widest_int bound = gcov_type_to_wide_int (nit); > Index: tree-profile.c > =================================================================== > --- tree-profile.c (revision 278944) > +++ tree-profile.c (working copy) > @@ -785,7 +785,8 @@ tree_profiling (void) > if (flag_branch_probabilities > && !thunk > && flag_profile_values > - && flag_value_profile_transformations) > + && flag_value_profile_transformations > + && profile_status_for_fn (cfun) == PROFILE_READ) > gimple_value_profile_transformations (); > > /* The above could hose dominator info. Currently there is