Hi Honza,
Thanks for the review, please find responses inline below.
> -----Original Message-----
> From: Jan Hubicka <[email protected]>
> Sent: 01 December 2025 04:36
> To: Prathamesh Kulkarni <[email protected]>
> Cc: [email protected]
> Subject: Re: [RFC] Enable time profile function reordering with
> AutoFDO
>
> External email: Use caution opening links or attachments
>
>
> > (1) Changes to auto-profile pass:
> > The patch adds a new field timestamp to function_instance, and
> > populates it in read_function_instance.
> >
> > It maintains a new timestamp_info_map from timestamp -> <name,
> > tp_first_run>, which maps timestamps sorted in ascending order to
> > (1..N), so lowest ordered timestamp is mapped to 1 and so on. The
> > rationale for this is that timestamps are 64-bit integers, and we
> > don't need the full 64-bit range for ordering by tp_first_run.
> >
> > During annotation, the timestamp associated with function_instance
> is
> > looked up in timestamp_info_map, and corresponding mapped value is
> > assigned to node->tp_first_run.
>
> I assume that the autofdo assigns timestamps to offile function
> instances, so this can also fit into function_instance datastructure,
> but I also suppose you want to keep it separate so it is optional?
function_instance does store timestamp, and uses it as a key into
timestamp_info_map.
My intent of keeping a separate data-structure timestamp_info_map was to sort
64-bit timestamps into ascending order,
and then map the sorted timestamp values to (1..N), and assign the mapped value
to node->tp_first_run.
So we don't need to up size of node->tp_first_run to 64bit (and I guess we
won't need the full 64-bit range here?)
> >
> > (2) Handling clones:
> > Currently, for clones not registered in call graph before auto-
> profile
> > pass, the tp_first_run field is copied from original function, when
> the clone is created.
> > However that may not correspond to the actual order of functions.
> >
> > For eg, if we have two profiled clones of foo:
> > foo.constprop.1, foo.constprop.2
> >
> > both will get same value for tp_first_run as foo->tp_first_run,
> which
> > might not correspond to time profile order.
> >
> > To address this, the patch introduces a new IPA pass
> > ipa_adjust_tp_first_run, that streams <clone name, tp_first_run>
> from
> > timestamp_info_map during LGEN, and during WPA reads it, and sets
> clone's tp_first_run field accordingly.
> > The pass is placed pretty late (just before locality_cloning), by
> that
> > point clones would be registered in the call graph.
>
> This is also a problem we have with -fprofile-use. Since for clones
> we
> know who is calling htem, perhaps we can just assign the time based on
> minimal first run of calling functions?
> (this may be too early in case the caller has a path avoiding callee,
> but may be reasonable heurstics)
>
> Can you separate the patch so this goes separately? Copying the
> tp_first_run value is not terrible and we can discuss it more after
> basic ifrastructure goes in.
Sure. The attached patch only keeps the base infra, I will send the part adding
ipa_adjust_tp_first_run in a separate patch.
> >
> > Dhruv's sourcefile tracking patch already handles LTO privatized
> functions.
> > The patch adds a (temporary) workaround for functions with
> > mismatched/empty filenames from gcov, to avoid getting dropped in
> > afdo_annotate_cfg by iterating thru all filenames in
> afdo_string_table if get_function_instance_by_decl fails to find
> function_instance with lbasename (DECL_SOURCE_FILE (decl)).
> >
> > (3) Grouping profiled functions together in as few partitions as
> possible (preferably single).
> > The patch places profiled functions in time profile order together
> in
> > as few paritions as possible to get better advantage of code
> locality.
> > Unlike PGO, where every instrumented function gets a time profile
> > counter, with AutoFDO, the sampled functions are a fraction of the
> total executed ones.
> > Similarly, in default_function_section, it overrides hot/cold
> > partitioning so that grouping of profiled functions isn't disrupted.
> >
> > (4) Option to disable profile driven opts.
> > The patch adds option -fauto-profile-reorder-only which only enables
> > time-profile reordering with AutoFDO (and disables profile driven
> opts):
> > (a) Useful as a debugging aid to isolate regression to either
> function reordering or profile driven opts.
> > (b) For our use case, it's also seemingly useful as a stopgap
> measure
> > to avoid regressions with AutoFDO profile driven opts, due to issues
> with profile quality obtained with merging of SPE and non SPE
> profiles.
> > We're actively working on resolving this.
> > (c) Possibly useful for architectures which do not support branch
> sampling.
> > The option is disabled by default.
> >
> > Ideally, I would like to make it a param (and not user facing
> option),
> > but I am not able to control enabling/disabling options in
> opts.cc:common_handle_option based on param value, will investigate
> this further.
>
> Yes, I think this should be a param. I already added option to not
> read function profile (only scale existing guessed profile), so
> perhaps we can have --param auto-profile-* to control individual auto-
> profile features that can be used for debugging until things gets more
> reliable.
Done in the attached patch.
>
> Also does the fature work with normal partitioning? It should make
> those function with tp_first_run nonzero go into early partitions, but
> then we will likely get relatively lot of calls to functions with zero
> counts just because we lost track of them?
The results I reported have been with balanced partitioning. I am seeing
roughly the same numbers on top of locality partitioning too.
>
> I also wonder if offlining can behave meaningfully here. I.e. when
> function profile is non-zero just copy first run from the caller.
IIUC, the purpose of afdo_offline pass is to remove cross module inlining from
function_instances so AutoFDO annotation
gets a better mapping from source locations to CFG ?
Currently, the patch only assigns timestamps to functions whose outline copy is
executed during train run (toplevel).
Would afdo offlining help for instance if a function is inlined into all it's
callers during train run (thus the outline copy has no timestamp info),
but during AutoFDO, it may not get inlined into all it's callers ? Currently in
this case, I guess we lose timestamp info (and thus time profile reordering for
such a function).
Currently the transform is gated explicitly on -fauto-profile
-fprofile-reorder-functions (so users can opt-in instead of being enabled by
default with -fauto-profile).
Should that be OK for gcc-16 ?
Does the patch look OK if bootstrap+test passes (with and without AutoFDO) ?
Signed-off-by: Prathamesh Kulkarni <[email protected]>
Thanks,
Prathamesh
> > diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc index
> > ae604762165..58e583d5b87 100644
> > --- a/gcc/auto-profile.cc
> > +++ b/gcc/auto-profile.cc
> > @@ -54,6 +54,8 @@ along with GCC; see the file COPYING3. If not see
> > #include "tree-pretty-print.h"
> > #include "gimple-pretty-print.h"
> > #include "output.h"
> > +#include "data-streamer.h"
> > +#include "lto-streamer.h"
> >
> > /* The following routines implements AutoFDO optimization.
> >
> > @@ -263,6 +265,8 @@ public:
> >
> > /* Return cgraph node corresponding to given name index. */
> > cgraph_node *get_cgraph_node (int);
> > +
> > + const string_vector& filenames () { return filenames_; }
> > private:
> > typedef std::map<const char *, unsigned, string_compare>
> string_index_map;
> > typedef std::map<const char *, auto_vec<unsigned>,
> string_compare>
> > @@ -294,7 +298,7 @@ public:
> > too. STACK is used to track the recursive creation process.
> */
> > static function_instance *
> > read_function_instance (function_instance_stack *stack,
> > - gcov_type head_count);
> > + gcov_type head_count, bool toplevel = true);
> >
> > /* Recursively deallocate all callsites (nested
> function_instances). */
> > ~function_instance ();
> > @@ -329,6 +333,12 @@ public:
> > return head_count_;
> > }
> >
> > + gcov_type
> > + timestamp () const
> > + {
> > + return timestamp_;
> > + }
> > +
> > /* Traverse callsites of the current function_instance to find
> one at the
> > location of LINENO and callee name represented in DECL.
> > LOCATION should match LINENO and is used to output
> diagnostics.
> > */ @@ -489,11 +499,13 @@ private:
> > /* Map from callsite to callee function_instance. */
> > typedef std::map<callsite, function_instance *> callsite_map;
> >
> > - function_instance (unsigned name, unsigned file_name, gcov_type
> > head_count)
> > + function_instance (unsigned name, unsigned file_name, gcov_type
> head_count,
> > + gcov_type timestamp)
> > : name_ (name), file_name_ (file_name), total_count_ (0),
> > - head_count_ (head_count), removed_icall_target_ (false),
> > - realized_ (false), in_worklist_ (false), inlined_to_ (NULL),
> > - location_ (UNKNOWN_LOCATION), call_location_
> (UNKNOWN_LOCATION)
> > + head_count_ (head_count), timestamp_ (timestamp),
> > + removed_icall_target_ (false), realized_ (false),
> in_worklist_ (false),
> > + inlined_to_ (NULL), location_ (UNKNOWN_LOCATION),
> > + call_location_ (UNKNOWN_LOCATION)
> > {
> > }
> >
> > @@ -512,6 +524,9 @@ private:
> > /* Entry BB's sample count. */
> > gcov_type head_count_;
> >
> > + /* perf timestamp associated with first execution of function.
> */
> > + gcov_type timestamp_;
> > +
> > /* Map from callsite location to callee function_instance. */
> > callsite_map callsites;
> >
> > @@ -559,7 +574,7 @@ public:
> > ~autofdo_source_profile ();
> >
> > /* For a given DECL, returns the top-level function_instance. */
> > - function_instance *get_function_instance_by_decl (tree decl)
> const;
> > + function_instance *get_function_instance_by_decl (tree decl,
> const
> > + char * = NULL) const;
> >
> > /* For a given name index (and original name), returns the top-
> level
> > * function_instance. */
> > @@ -632,6 +647,13 @@ static autofdo_source_profile
> > *afdo_source_profile;
> > /* gcov_summary structure to store the profile_info. */ static
> > gcov_summary *afdo_profile_info;
> >
> > +/* Map from timestamp -> <name, tp_first_run>.
> > +
> > +The purpose of this map is to map 64-bit timestamp values to
> (1..N),
> > +sorted by ascending order of timestamps and assign that to
> > +node->tp_first_run, since we don't need the full 64-bit range. */
> > +static std::map<gcov_type, std::pair<int, int>> timestamp_info_map;
> > +
> > /* Scaling factor for afdo data. Compared to normal profile
> > AFDO profile counts are much lower, depending on sampling
> > frequency. We scale data up to reudce effects of roundoff @@
> > -2477,7 +2499,7 @@
> autofdo_source_profile::offline_unrealized_inlines ()
> > delete f;
> > }
> > /* If function is not in symbol table, remove it. */
> > - else if (!f->realized_p ())
> > + else if (in_map && !f->realized_p ())
> > {
> > if (dump_file)
> > fprintf (dump_file, "Removing optimized out function
> %s\n",
> > @@ -2502,6 +2524,7 @@
> > autofdo_source_profile::offline_unrealized_inlines ()
> > /* function instance profile format:
> >
> > ENTRY_COUNT: 8 bytes
> > + TIMESTAMP: 8 bytes (only for toplevel symbols)
> > NAME_INDEX: 4 bytes
> > NUM_POS_COUNTS: 4 bytes
> > NUM_CALLSITES: 4 byte
> > @@ -2528,14 +2551,17 @@
> > autofdo_source_profile::offline_unrealized_inlines ()
> >
> > function_instance *
> > function_instance::read_function_instance (function_instance_stack
> *stack,
> > - gcov_type head_count)
> > + gcov_type head_count, bool
> > + toplevel)
> > {
> > + gcov_type_unsigned timestamp = 0;
> > + if (toplevel)
> > + timestamp = (gcov_type_unsigned) gcov_read_counter ();
> > unsigned name = gcov_read_unsigned ();
> > unsigned num_pos_counts = gcov_read_unsigned ();
> > unsigned num_callsites = gcov_read_unsigned ();
> > function_instance *s
> > = new function_instance (name, afdo_string_table-
> >get_filename_idx (name),
> > - head_count);
> > + head_count, timestamp);
> > if (!stack->is_empty ())
> > s->set_inlined_to (stack->last ());
> > stack->safe_push (s);
> > @@ -2563,7 +2589,7 @@ function_instance::read_function_instance
> (function_instance_stack *stack,
> > {
> > unsigned offset = gcov_read_unsigned ();
> > function_instance *callee_function_instance
> > - = read_function_instance (stack, -1);
> > + = read_function_instance (stack, -1, false);
> > s->callsites[std::make_pair (offset,
> callee_function_instance->name ())]
> > = callee_function_instance;
> > }
> > @@ -2585,9 +2611,11 @@
> autofdo_source_profile::~autofdo_source_profile
> > ()
> > /* For a given DECL, returns the top-level function_instance. */
> >
> > function_instance *
> > -autofdo_source_profile::get_function_instance_by_decl (tree decl)
> > const
> > +autofdo_source_profile::get_function_instance_by_decl (tree decl,
> > +const char *filename) const
> > {
> > - const char *filename = lbasename (DECL_SOURCE_FILE (decl));
> > + if (!filename)
> > + filename = lbasename (DECL_SOURCE_FILE (decl));
> > +
> > int index = afdo_string_table->get_index_by_decl (decl);
> > if (index == -1)
> > return NULL;
> > @@ -2829,7 +2857,21 @@ autofdo_source_profile::read ()
> > fatal_error (UNKNOWN_LOCATION,
> > "auto-profile contains duplicated function
> instance %s",
> > afdo_string_table->get_name (s->name ()));
> > +
> > + timestamp_info_map.insert({s->timestamp (),
> > + std::make_pair (s->name (), 0)});
> > }
> > +
> > + /* timestamp_info_map is std::map with timestamp as key,
> > + so it's already sorted in ascending order wrt timestamps.
> > + This loop maps function with lowest timestamp to 1, and so on.
> > + In afdo_annotate_cfg, node->tp_first_run is then set to
> corresponding
> > + tp_first_run value. */
> > +
> > + int tp_first_run = 1;
> > + for (auto &p : timestamp_info_map)
> > + p.second.second = tp_first_run++;
> > +
> > int hot_frac = param_hot_bb_count_fraction;
> > /* Scale up the profile, but leave some bits in case some counts
> gets
> > bigger than sum_max eventually. */ @@ -4079,6 +4121,17 @@
> > afdo_annotate_cfg (void)
> > = afdo_source_profile->get_function_instance_by_decl (
> > current_function_decl);
> >
> > + /* FIXME: This is a workaround for sourcefile tracking, if
> afdo_string_table
> > + ends up with empty filename or incorrect filename for the
> function and
> > + should be removed once issues with sourcefile tracking get
> > + fixed. */ if (s == NULL)
> > + for (unsigned i = 0; i < afdo_string_table->filenames ().length
> (); i++)
> > + {
> > + s = afdo_source_profile->get_function_instance_by_decl
> (current_function_decl, afdo_string_table->filenames()[i]);
> > + if (s)
> > + break;
> > + }
> > +
> > if (s == NULL)
> > {
> > if (dump_file)
> > @@ -4087,7 +4140,8 @@ afdo_annotate_cfg (void)
> > /* create_gcov only dumps symbols with some samples in them.
> > This means that we get nonempty zero_bbs only if some
> > nonzero counts in profile were not matched with statements.
> */
> > - if (!flag_profile_partial_training)
> > + if (!flag_profile_partial_training
> > + && !flag_auto_profile_reorder_only)
> > {
> > FOR_ALL_BB_FN (bb, cfun)
> > if (bb->count.quality () == GUESSED_LOCAL) @@ -4097,6
> > +4151,20 @@ afdo_annotate_cfg (void)
> > return;
> > }
> >
> > + if (auto it = timestamp_info_map.find (s->timestamp ());
> > + it != timestamp_info_map.end ())
> > + {
> > + cgraph_node *node = cgraph_node::get (current_function_decl);
> > + node->tp_first_run = it->second.second;
> > +
> > + if (dump_file)
> > + fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
> > + node->asm_name (), node->tp_first_run);
> > + }
> > +
> > + if (flag_auto_profile_reorder_only)
> > + return;
> > +
> > calculate_dominance_info (CDI_POST_DOMINATORS);
> > calculate_dominance_info (CDI_DOMINATORS);
> > loop_optimizer_init (0);
> > @@ -4496,3 +4564,182 @@ make_pass_ipa_auto_profile_offline
> > (gcc::context *ctxt) {
> > return new pass_ipa_auto_profile_offline (ctxt); }
> > +
> > +/* Map from name -> tp_first_run for clones. We need this for
> initiailizing tp_first_run
> > + of clones created after auto-profile pass. */
> > +
> > +static std::map<const char *, int, autofdo::string_compare>
> > +clone_timestamp_map;
>
> This probably should be symbol_summary, so it stays up to date with
> symbol table changes?
> > +
> > +/* Check if name is a clone function name. LTO privatized clones
> should already
> > + be handled with sourcefile tracking. */
> > +
> > +static bool
> > +clone_function_p (const char *name)
> > +{
> > + return strchr (name, '.') && !strstr (name, "lto_priv"); }
> > +
> > +/* Gather <name, tp_first_run> value from
> > +autofdo::timestamp_info_map. */
> > +
> > +static void
> > +adjust_tp_first_run_generate_summary (void) {
> > + for (const auto &p : autofdo::timestamp_info_map)
> > + {
> > + gcov_type timestamp = p.first;
> > + int name_id = p.second.first;
> > + int tp_first_run = p.second.second;
> > +
> > + if (const char *name = autofdo::afdo_string_table->get_name
> (name_id);
> > + name && clone_function_p (name))
> > + clone_timestamp_map.insert({xstrdup (name), tp_first_run});
> > + }
> > +}
> > +
> > +/* Stream out <name, tp_first_run>. */
> > +
> > +static void
> > +adjust_tp_first_run_write_summary (void) {
> > + struct output_block *ob
> > + = create_output_block (LTO_section_ipa_adjust_tp_first_run);
> > +
> > + streamer_write_uhwi (ob, clone_timestamp_map.size ()); for
> (const
> > + auto &p : clone_timestamp_map)
> > + {
> > + streamer_write_string (ob, ob->main_stream, p.first, true);
> > + streamer_write_hwi (ob, p.second);
> > + }
> > +
> > + produce_asm (ob);
> > + destroy_output_block (ob);
> > +}
> > +
> > +/* Stream in <name, tp_first_run> and put it in
> clone_timestamp_map.
> > +*/
> > +
> > +static void
> > +adjust_tp_first_run_read_summary (void) {
> > + struct lto_file_decl_data **file_data_vec =
> lto_get_file_decl_data
> > +();
> > + struct lto_file_decl_data *file_data;
> > + unsigned j = 0;
> > +
> > + while ((file_data = file_data_vec[j++]))
> > + {
> > + size_t len;
> > + const char *data
> > + = lto_get_summary_section_data (file_data,
> > +
> LTO_section_ipa_adjust_tp_first_run,
> > + &len);
> > + if (!data)
> > + continue;
> > +
> > + const struct lto_function_header *header
> > + = (const struct lto_function_header *) data;
> > + const int cfg_offset = sizeof (struct lto_function_header);
> > + const int main_offset = cfg_offset + header->cfg_size;
> > + const int string_offset = main_offset + header->main_size;
> > + struct data_in *data_in;
> > +
> > + lto_input_block ib ((const char *) data + main_offset,
> > + header->main_size, file_data);
> > +
> > + data_in
> > + = lto_data_in_create (file_data, (const char *) data +
> string_offset,
> > + header->string_size, vNULL);
> > +
> > + unsigned num_clones = streamer_read_uhwi (&ib);
> > + for (unsigned i = 0; i < num_clones; i++)
> > + {
> > + const char *fn_name = streamer_read_string (data_in, &ib);
> > + int tp_first_run = streamer_read_hwi (&ib);
> > + clone_timestamp_map.insert ({xstrdup (fn_name),
> tp_first_run});
> > + }
> > +
> > + lto_free_section_data (file_data,
> LTO_section_ipa_adjust_tp_first_run,
> > + NULL, data, len);
> > + lto_data_in_delete (data_in);
> > + }
> > +}
> > +
> > +/* A late IPA pass for handling clones.
> > + During auto-profile pass, for clones that aren't registered in
> call graph,
> > + stream <name, tp_first_run> from timestamp_info_map during LGEN.
> > + And during WPA, stream back the info, and assign the field
> value.
> > + The pass is placed sufficiently late (just before
> locality_cloning), that
> > + it should have clones registered in the call graph. */
> > +
> > +static unsigned
> > +adjust_tp_first_run ()
> > +{
> > + for (const auto &p : clone_timestamp_map)
> > + if (cgraph_node *node
> > + = cgraph_node::get_for_asmname (get_identifier (p.first)))
> > + {
> > + node->tp_first_run = p.second;
> > + if (dump_file)
> > + fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
> > + node->asm_name (), node->tp_first_run);
> > + }
> > +
> > + for (const auto &p : clone_timestamp_map)
> > + free ((void *) p.first);
> > + clone_timestamp_map.clear ();
> > +
> > + return 0;
> > +}
> > +
> > +namespace {
> > +
> > +const pass_data pass_data_ipa_adjust_tp_first_run = {
> > + IPA_PASS, /* type */
> > + "adjust-tp_first_run", /* name */
> > + OPTGROUP_NONE, /* optinfo_flags */
> > + TV_CGRAPHOPT, /* tv_id */
> > + 0, /* properties_required */
> > + 0, /* properties_provided */
> > + 0, /* properties_destroyed */
> > + 0, /* todo_flags_start */
> > + 0, /* todo_flags_finish */
> > +};
> > +
> > +class pass_ipa_adjust_tp_first_run : public ipa_opt_pass_d {
> > +public:
> > + pass_ipa_adjust_tp_first_run (gcc::context *ctxt)
> > + : ipa_opt_pass_d (pass_data_ipa_adjust_tp_first_run, ctxt,
> > + adjust_tp_first_run_generate_summary, /*
> generate_summary */
> > + adjust_tp_first_run_write_summary, /*
> write_summary */
> > + adjust_tp_first_run_read_summary, /*
> read_summary */
> > + NULL, /* write_optimization_summary */
> > + NULL, /* read_optimization_summary */
> > + NULL, /* stmt_fixup */
> > + 0, /* function_transform_todo_flags_start */
> > + NULL, /* function_transform */
> > + NULL) /* variable_transform */
> > + {}
> > +
> > + /* opt_pass methods: */
> > + bool
> > + gate (function *) final override
> > + {
> > + return flag_auto_profile;
> > + }
> > +
> > + unsigned
> > + execute (function *) final override {
> > + return adjust_tp_first_run ();
> > + }
> > +
> > +}; // class pass_ipa_adjust_tp_first_run
> > +
> > +} // anon namespace
> > +
> > +ipa_opt_pass_d *
> > +make_pass_ipa_adjust_tp_first_run (gcc::context *ctxt) {
> > + return new pass_ipa_adjust_tp_first_run (ctxt); }
> > +
> > diff --git a/gcc/common.opt b/gcc/common.opt index
> > f6d93dc05fb..d7c028f87fe 100644
> > --- a/gcc/common.opt
> > +++ b/gcc/common.opt
> > @@ -1203,6 +1203,10 @@ fauto-profile-inlining Common
> > Var(flag_auto_profile_inlining) Init(1) Optimization Perform
> inlining
> > using auto-profile.
> >
> > +fauto-profile-reorder-only
> > +Common Var(flag_auto_profile_reorder_only) Init(0) Optimization
> > +Perform only function based reordering with fauto-profile.
> > +
> > ; -fcheck-bounds causes gcc to generate array bounds checks.
> > ; For C, C++ and ObjC: defaults off.
> > ; For Java: defaults to on.
> > diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-
> cloning.cc
> > index 2684046bd2d..1653ea7b961 100644
> > --- a/gcc/ipa-locality-cloning.cc
> > +++ b/gcc/ipa-locality-cloning.cc
> > @@ -994,6 +994,50 @@ locality_determine_static_order
> (auto_vec<locality_order *> *order)
> > return true;
> > }
> >
> > +/* Similar to lto-partition.cc:partition_by_timestamp.s */
> > +
> > +static locality_partition
> > +partition_by_timestamps (int64_t max_partition_size, int&
> > +npartitions) {
> > + vec<cgraph_node *> cnodes = vNULL;
> > + cgraph_node *node;
> > + FOR_EACH_DEFINED_FUNCTION (node)
> > + if (node->tp_first_run > 0
> > + && node->get_partitioning_class () == SYMBOL_PARTITION
> > + && ipa_size_summaries->get (node))
> > + cnodes.safe_push (node);
> > + cnodes.qsort (tp_first_run_node_cmp);
> > +
> > + unsigned n_partitioned_symbols = 0;
> > +
> > + locality_partition partition = NULL; unsigned i;
> FOR_EACH_VEC_ELT
> > + (cnodes, i, node)
> > + {
> > + if (node_partitioned_p (node))
> > + continue;
> > + ipa_size_summary *size_summary = ipa_size_summaries->get
> > + (node);
> > +
> > + if (partition == NULL
> > + || partition->insns + size_summary->size >
> max_partition_size)
> > + partition = create_partition (npartitions);
> > +
> > + add_node_to_partition (partition, node);
> > + n_partitioned_symbols++;
> > + }
> > +
> > + if (dump_file)
> > + {
> > + fprintf (dump_file,
> > + "partition_by_timestamps: Total partitions: %d\n",
> npartitions);
> > + fprintf (dump_file,
> > + "partition_by_timestamps: Number of partitioned
> symbols: %d\n",
> > + n_partitioned_symbols);
> > + }
> > +
> > + return partition;
> > +}
> > +
> > /* Partitioning for code locality.
> > 1. Create and sort callchains. If PGO is available, use real
> profile
> > counts. Otherwise, use a set of heuristics to sort the
> callchains.
> > @@ -1007,7 +1051,7 @@ locality_partition_and_clone (int
> max_locality_partition_size,
> > lto_locality_cloning_model
> cloning_model,
> > int freq_denominator, int size) {
> > - locality_partition partition;
> > + locality_partition partition = NULL;
> > int npartitions = 0;
> >
> > auto_vec<locality_order *> order;
> > @@ -1040,7 +1084,11 @@ locality_partition_and_clone (int
> max_locality_partition_size,
> > int64_t partition_size
> > = max_locality_partition_size
> > ? max_locality_partition_size : param_max_partition_size;
> > - partition = create_partition (npartitions);
> > +
> > + if (flag_auto_profile && flag_profile_reorder_functions)
> > + partition = partition_by_timestamps (partition_size,
> > + npartitions); if (!partition)
> > + partition = create_partition (npartitions);
> >
> > for (unsigned i = 0; i < order.length (); i++)
> > {
> > diff --git a/gcc/lto-section-in.cc b/gcc/lto-section-in.cc index
> > 1dd9520137a..2e2e30ea400 100644
> > --- a/gcc/lto-section-in.cc
> > +++ b/gcc/lto-section-in.cc
> > @@ -57,6 +57,7 @@ const char *lto_section_name[LTO_N_SECTION_TYPES]
> =
> > "ipa_sra",
> > "odr_types",
> > "ipa_modref",
> > + "ipa_adjust_tp_first_run",
> > };
> >
> > /* Hooks so that the ipa passes can call into the lto front end to
> > get diff --git a/gcc/lto-streamer.h b/gcc/lto-streamer.h index
> > 4b7209e3d82..242579918b0 100644
> > --- a/gcc/lto-streamer.h
> > +++ b/gcc/lto-streamer.h
> > @@ -225,6 +225,7 @@ enum lto_section_type
> > LTO_section_ipa_sra,
> > LTO_section_odr_types,
> > LTO_section_ipa_modref,
> > + LTO_section_ipa_adjust_tp_first_run,
> > LTO_N_SECTION_TYPES /* Must be last. */
> > };
> >
> > diff --git a/gcc/lto/lto-partition.cc b/gcc/lto/lto-partition.cc
> index
> > c7e69ee3161..c7937107ad7 100644
> > --- a/gcc/lto/lto-partition.cc
> > +++ b/gcc/lto/lto-partition.cc
> > @@ -87,6 +87,15 @@ new_partition (const char *name)
> > return part;
> > }
> >
> > +/* Create and return the created partition of name NAME. */
> > +
> > +static ltrans_partition
> > +create_partition (int &npartitions, const char *name) {
> > + npartitions++;
> > + return new_partition (name);
> > +}
> > +
> > /* Free memory used by ltrans partition.
> > Encoder can be kept to be freed after streaming. */ static
> void
> > @@ -1017,6 +1026,50 @@ lto_cache_map (int n_lto_partitions, int
> max_partition_size)
> > partitioner.apply (partitions, n_lto_partitions); }
> >
> > +/* Place symbols that are profiled in as few partitions as
> possible.
> > +*/
> > +
> > +static ltrans_partition
> > +partition_by_timestamps (int max_partition_size, int &npartitions)
> {
> > + vec<cgraph_node *> cnodes = vNULL;
> > + cgraph_node *node;
> > + FOR_EACH_DEFINED_FUNCTION (node)
> > + if (node->tp_first_run > 0
> > + && node->get_partitioning_class () == SYMBOL_PARTITION
> > + && ipa_size_summaries->get (node))
> > + cnodes.safe_push (node);
> > + cnodes.qsort (tp_first_run_node_cmp);
> > +
> > + unsigned n_partitioned_symbols = 0;
> > +
> > + ltrans_partition partition = NULL;
> > + unsigned i;
> > + FOR_EACH_VEC_ELT (cnodes, i, node)
> > + {
> > + if (symbol_partitioned_p (node))
> > + continue;
> > +
> > + ipa_size_summary *size_summary = ipa_size_summaries->get
> (node);
> > + if (partition == NULL
> > + || partition->insns + size_summary->size >
> max_partition_size)
> > + partition = create_partition (npartitions, "");
> > +
> > + add_symbol_to_partition (partition, node);
> > + n_partitioned_symbols++;
> > + }
> > +
> > + if (dump_file)
> > + {
> > + fprintf (dump_file,
> > + "partition_by_timestamps: Total partitions: %d\n",
> npartitions);
> > + fprintf (dump_file,
> > + "partition_by_timestamps: Number of partitioned
> symbols: %d\n",
> > + n_partitioned_symbols);
> > + }
> > +
> > + return partition;
> > +}
> > +
> > /* Group cgraph nodes into equally-sized partitions.
> >
> > The partitioning algorithm is simple: nodes are taken in
> predefined order.
> > @@ -1072,7 +1125,7 @@ lto_balanced_map (int n_lto_partitions, int
> max_partition_size)
> > int64_t cost = 0, internal = 0;
> > unsigned int best_n_nodes = 0, best_i = 0;
> > int64_t best_cost = -1, best_internal = 0, best_size = 0;
> > - int npartitions;
> > + int npartitions = 0;
> > int current_order = -1;
> > int noreorder_pos = 0;
> >
> > @@ -1092,6 +1145,19 @@ lto_balanced_map (int n_lto_partitions, int
> > max_partition_size)
> >
> > original_total_size = total_size;
> >
> > + /* With -fauto-profile -fprofile-reorder-functions, place all
> symbols which
> > + are profiled together into as few partitions as possible. The
> rationale
> > + for doing this with AutoFDO is that the number of sampled
> functions is a
> > + fraction of total number of executed functions (unlike PGO
> where each
> > + executed function gets instrumented with time profile
> counter), and
> > + placing them together helps with code locality. */
> > +
> > + partition = NULL;
> > + if (flag_auto_profile
> > + && flag_profile_reorder_functions
> > + && noreorder.length () == 0)
> > + partition = partition_by_timestamps (max_partition_size,
> > + npartitions);
> > +
> > /* Streaming works best when the source units do not cross
> partition
> > boundaries much. This is because importing function from a
> source
> > unit tends to import a lot of global trees defined there. We
> > should @@ -1126,8 +1192,12 @@ lto_balanced_map (int
> n_lto_partitions, int max_partition_size)
> > partition_size = total_size / n_lto_partitions;
> > if (partition_size < param_min_partition_size)
> > partition_size = param_min_partition_size;
> > - npartitions = 1;
> > - partition = new_partition ("");
> > +
> > + if (npartitions == 0)
> > + {
> > + npartitions = 1;
> > + partition = new_partition ("");
> > + }
> > if (dump_file)
> > fprintf (dump_file, "Total unit size: %" PRId64 ", partition
> size: %" PRId64 "\n",
> > total_size, partition_size); @@ -1474,15 +1544,6 @@
> > add_node_references_to_partition (ltrans_partition partition,
> > symtab_node *node)
> >
> > }
> >
> > -/* Create and return the created partition of name NAME. */
> > -
> > -static ltrans_partition
> > -create_partition (int &npartitions, const char *name) -{
> > - npartitions++;
> > - return new_partition (name);
> > -}
> > -
> > /* Partitioning for code locality.
> > The partitioning plan (and prerequisite cloning) will have been
> done by the
> > IPA locality cloning pass. This function just implements that
> > plan by diff --git a/gcc/opts.cc b/gcc/opts.cc index
> > 10ce2c3de33..a5a9700dc08 100644
> > --- a/gcc/opts.cc
> > +++ b/gcc/opts.cc
> > @@ -3179,7 +3179,13 @@ common_handle_option (struct gcc_options
> *opts,
> > /* No break here - do -fauto-profile processing. */
> > /* FALLTHRU */
> > case OPT_fauto_profile:
> > - enable_fdo_optimizations (opts, opts_set, value, true);
> > + if (opts->x_flag_auto_profile_reorder_only)
> > + {
> > + opts->x_flag_profile_reorder_functions = true;
> > + opts->x_flag_auto_profile_inlining = false;
> > + }
> > + else
> > + enable_fdo_optimizations (opts, opts_set, value, true);
> > SET_OPTION_IF_UNSET (opts, opts_set, flag_profile_correction,
> value);
> > break;
> >
> > diff --git a/gcc/passes.def b/gcc/passes.def index
> > 3f828477b68..bd0e80dfa5f 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -171,6 +171,7 @@ along with GCC; see the file COPYING3. If not
> see
> > NEXT_PASS (pass_ipa_sra);
> > NEXT_PASS (pass_ipa_fn_summary);
> > NEXT_PASS (pass_ipa_inline);
> > + NEXT_PASS (pass_ipa_adjust_tp_first_run);
> > NEXT_PASS (pass_ipa_locality_cloning);
> > NEXT_PASS (pass_ipa_pure_const);
> > NEXT_PASS (pass_ipa_modref);
> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index
> > 61cec52c624..8544a6bc552 100644
> > --- a/gcc/tree-pass.h
> > +++ b/gcc/tree-pass.h
> > @@ -553,6 +553,7 @@ extern ipa_opt_pass_d *make_pass_ipa_cdtor_merge
> > (gcc::context *ctxt); extern ipa_opt_pass_d
> *make_pass_ipa_single_use
> > (gcc::context *ctxt); extern ipa_opt_pass_d *make_pass_ipa_comdats
> > (gcc::context *ctxt); extern ipa_opt_pass_d *make_pass_ipa_modref
> > (gcc::context *ctxt);
> > +extern ipa_opt_pass_d *make_pass_ipa_adjust_tp_first_run
> > +(gcc::context *ctxt);
> > extern ipa_opt_pass_d *make_pass_ipa_locality_cloning (gcc::context
> > *ctxt);
> >
> > extern gimple_opt_pass *make_pass_cleanup_cfg_post_optimizing
> > (gcc::context diff --git a/gcc/varasm.cc b/gcc/varasm.cc index
> > 0d78f5b384f..34673a970ea 100644
> > --- a/gcc/varasm.cc
> > +++ b/gcc/varasm.cc
> > @@ -610,6 +610,15 @@ default_function_section (tree decl, enum
> node_frequency freq,
> > if (!flag_reorder_functions
> > || !targetm_common.have_named_sections)
> > return NULL;
> > +
> > + /* With -fauto-profile -fprofile-reorder-functions, give higher
> priority
> > + to time profile based reordering, and ensure the reordering
> isn't split
> > + by hot/cold partitioning. */
> > + if (flag_auto_profile
> > + && flag_profile_reorder_functions
> > + && cgraph_node::get (decl)->tp_first_run > 0)
> > + return NULL;
> > +
> > /* Startup code should go to startup subsection unless it is
> > unlikely executed (this happens especially with function
> splitting
> > where we can split away unnecessary parts of static
> > constructors. */
Enable time profile function reordering with AutoFDO.
The patch enables time profile based reordering with AutoFDO with
-fauto-profile -fprofile-reorder-functions, by mapping timestamps obtained from
perf
into node->tp_first_run.
The rationale for doing this is:
(1) GCC already implements time-profile function reordering with PGO, the patch
enables
it with AutoFDO.
(2) While time profile ordering is primarly meant for optimizing startup time,
we've also observed good effects on code-locality for large internal workloads.
(3) Possibly useful for function reordering when accurate profile annotation is
hard with AutoFDO -- For eg, if branch samples are missing (due to absence of
LBR like structure).
On AutoFDO tools side, a corresponding patch extends gcov to emit 64-bit perf
timestamp that
records first execution of function, which loosely corresponds to PGO's
time_profile counter.
The timestamp is stored adjacent to head field in toplevel function info.
On GCC side, this patch makes the following changes:
(1) Changes to auto-profile pass:
The patch adds a new field timestamp to function_instance,
and populates it in read_function_instance.
It maintains a new timestamp_info_map from timestamp -> <name, tp_first_run>,
which maps timestamps sorted in ascending order to (1..N), so lowest ordered
timestamp is mapped to 1 and so on. The rationale for this is that
timestamps are 64-bit integers, and we don't need the full 64-bit range
for ordering by tp_first_run.
During annotation, the timestamp associated with function_instance is looked up
in timestamp_info_map, and corresponding mapped value is assigned
to node->tp_first_run.
Dhruv's sourcefile tracking patch already handles LTO privatized symbols.
The patch adds a workaround for mismatched/empty filenames, which should go away
when the issues with AutoFDO tools dwarf parsing are resolved.
(2) Grouping profiled functions together in as few partitions as possible:
The patch places profiled functions in time profile order together in as
few paritions as possible to get better advantage of code locality.
Unlike PGO, where every instrumented function gets a time profile,
with AutoFDO, the sampled functions are a fraction of the total executed ones.
Similarly, in default_function_section, it overrides hot/cold partitioning so
that grouping of profiled functions isn't disrupted.
(3) Param to disable profile driven opts.
The patch adds param auto-profile-reorder-only which only enables time-profile
reordering with
AutoFDO:
(a) Useful as a debugging aid to isolate regression to either function
reordering or profile driven opts.
(b) As a stopgap measure to avoid regressions with AutoFDO profile driven opts.
(c) Possibly useful for architectures which do not support branch sampling.
gcc/ChangeLog:
* auto-profile.cc: (string_table::filenames): New method.
(function_instance::timestamp_): New member.
(function_instance::timestamp): New accessor for timestamp_ member.
(autofdo::timestamp_info_map): New std::map.
(function_instance::function_instance): Add argument timestamp and set
it to member timestamp_.
(function_instance::read_function_instance): Adjust prototype and read
timestamp.
(autofdo_source_profile::get_function_instance_by_decl): New argument
filename with default value NULL.
(autofdo_source_profile::read): Populate timestamp_info_map.
(afdo_annotate_cfg): Assign node->tp_first_run based on
timestamp_info_map and bail out of annotation if
param auto-profile-reorder-only is enabled.
* params.opt: New param auto-profile-reorder-only.
* ipa-locality-cloning.cc (partition_by_timestamps): New function.
(locality_partition_and_clone): Call partition_by_timestamps.
* varasm.cc (default_function_section): Bail out if -fauto-profile
-fprofile-reorder-functions is enabled and node->tp_first_run > 0.
gcc/lto/ChangeLog:
* lto-partition.cc (create_partition): Move higher up in the file.
(partition_by_timestamps): New function.
(lto_balanced_map): Call partition_by_timestamps if -fauto-profile
-fprofile-reorder-functions is passed and noreroder is empty.
Signed-off-by: Prathamesh Kulkarni <[email protected]>
diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
index 1e50d8e6e71..cdb6dad06e5 100644
--- a/gcc/auto-profile.cc
+++ b/gcc/auto-profile.cc
@@ -281,6 +281,8 @@ public:
/* Return cgraph node corresponding to given name index. */
cgraph_node *get_cgraph_node (int);
+
+ const string_vector& filenames () { return filenames_; }
private:
typedef std::map<const char *, unsigned, string_compare> string_index_map;
typedef std::map<const char *, auto_vec<unsigned>, string_compare>
@@ -312,7 +314,7 @@ public:
too. STACK is used to track the recursive creation process. */
static function_instance *
read_function_instance (function_instance_stack *stack,
- gcov_type head_count);
+ gcov_type head_count, bool toplevel = true);
/* Recursively deallocate all callsites (nested function_instances). */
~function_instance ();
@@ -347,6 +349,12 @@ public:
return head_count_;
}
+ gcov_type
+ timestamp () const
+ {
+ return timestamp_;
+ }
+
/* Traverse callsites of the current function_instance to find one at the
location of LINENO and callee name represented in DECL.
LOCATION should match LINENO and is used to output diagnostics. */
@@ -507,11 +515,13 @@ private:
/* Map from callsite to callee function_instance. */
typedef std::map<callsite, function_instance *> callsite_map;
- function_instance (unsigned name, unsigned file_name, gcov_type head_count)
+ function_instance (unsigned name, unsigned file_name, gcov_type head_count,
+ gcov_type timestamp)
: name_ (name), file_name_ (file_name), total_count_ (0),
- head_count_ (head_count), removed_icall_target_ (false),
- realized_ (false), in_worklist_ (false), inlined_to_ (NULL),
- location_ (UNKNOWN_LOCATION), call_location_ (UNKNOWN_LOCATION)
+ head_count_ (head_count), timestamp_ (timestamp),
+ removed_icall_target_ (false), realized_ (false), in_worklist_ (false),
+ inlined_to_ (NULL), location_ (UNKNOWN_LOCATION),
+ call_location_ (UNKNOWN_LOCATION)
{
}
@@ -530,6 +540,9 @@ private:
/* Entry BB's sample count. */
gcov_type head_count_;
+ /* perf timestamp associated with first execution of function. */
+ gcov_type timestamp_;
+
/* Map from callsite location to callee function_instance. */
callsite_map callsites;
@@ -577,7 +590,7 @@ public:
~autofdo_source_profile ();
/* For a given DECL, returns the top-level function_instance. */
- function_instance *get_function_instance_by_decl (tree decl) const;
+ function_instance *get_function_instance_by_decl (tree decl, const char * =
NULL) const;
/* For a given name index (and original name), returns the top-level
* function_instance. */
@@ -650,6 +663,13 @@ static autofdo_source_profile *afdo_source_profile;
/* gcov_summary structure to store the profile_info. */
static gcov_summary *afdo_profile_info;
+/* Map from timestamp -> <name, tp_first_run>.
+
+The purpose of this map is to map 64-bit timestamp values to (1..N),
+sorted by ascending order of timestamps and assign that to node->tp_first_run,
+since we don't need the full 64-bit range. */
+static std::map<gcov_type, std::pair<int, int>> timestamp_info_map;
+
/* Scaling factor for afdo data. Compared to normal profile
AFDO profile counts are much lower, depending on sampling
frequency. We scale data up to reudce effects of roundoff
@@ -2523,6 +2543,7 @@ autofdo_source_profile::offline_unrealized_inlines ()
/* function instance profile format:
ENTRY_COUNT: 8 bytes
+ TIMESTAMP: 8 bytes (only for toplevel symbols)
NAME_INDEX: 4 bytes
NUM_POS_COUNTS: 4 bytes
NUM_CALLSITES: 4 byte
@@ -2549,14 +2570,17 @@ autofdo_source_profile::offline_unrealized_inlines ()
function_instance *
function_instance::read_function_instance (function_instance_stack *stack,
- gcov_type head_count)
+ gcov_type head_count, bool toplevel)
{
+ gcov_type_unsigned timestamp = 0;
+ if (toplevel)
+ timestamp = (gcov_type_unsigned) gcov_read_counter ();
unsigned name = gcov_read_unsigned ();
unsigned num_pos_counts = gcov_read_unsigned ();
unsigned num_callsites = gcov_read_unsigned ();
function_instance *s
= new function_instance (name, afdo_string_table->get_filename_idx (name),
- head_count);
+ head_count, timestamp);
if (!stack->is_empty ())
s->set_inlined_to (stack->last ());
stack->safe_push (s);
@@ -2584,7 +2608,7 @@ function_instance::read_function_instance
(function_instance_stack *stack,
{
unsigned offset = gcov_read_unsigned ();
function_instance *callee_function_instance
- = read_function_instance (stack, -1);
+ = read_function_instance (stack, -1, false);
s->callsites[std::make_pair (offset, callee_function_instance->name ())]
= callee_function_instance;
}
@@ -2606,9 +2630,11 @@ autofdo_source_profile::~autofdo_source_profile ()
/* For a given DECL, returns the top-level function_instance. */
function_instance *
-autofdo_source_profile::get_function_instance_by_decl (tree decl) const
+autofdo_source_profile::get_function_instance_by_decl (tree decl, const char
*filename) const
{
- const char *filename = lbasename (DECL_SOURCE_FILE (decl));
+ if (!filename)
+ filename = lbasename (DECL_SOURCE_FILE (decl));
+
int index = afdo_string_table->get_index_by_decl (decl);
if (index == -1)
return NULL;
@@ -2851,7 +2877,21 @@ autofdo_source_profile::read ()
fatal_error (UNKNOWN_LOCATION,
"auto-profile contains duplicated function instance %s",
afdo_string_table->get_name (s->name ()));
+
+ timestamp_info_map.insert({s->timestamp (),
+ std::make_pair (s->name (), 0)});
}
+
+ /* timestamp_info_map is std::map with timestamp as key,
+ so it's already sorted in ascending order wrt timestamps.
+ This loop maps function with lowest timestamp to 1, and so on.
+ In afdo_annotate_cfg, node->tp_first_run is then set to corresponding
+ tp_first_run value. */
+
+ int tp_first_run = 1;
+ for (auto &p : timestamp_info_map)
+ p.second.second = tp_first_run++;
+
int hot_frac = param_hot_bb_count_fraction;
/* Scale up the profile, but leave some bits in case some counts gets
bigger than sum_max eventually. */
@@ -4214,6 +4254,17 @@ afdo_annotate_cfg (void)
= afdo_source_profile->get_function_instance_by_decl (
current_function_decl);
+ /* FIXME: This is a workaround for sourcefile tracking, if afdo_string_table
+ ends up with empty filename or incorrect filename for the function and
+ should be removed once issues with sourcefile tracking get fixed. */
+ if (s == NULL)
+ for (unsigned i = 0; i < afdo_string_table->filenames ().length (); i++)
+ {
+ s = afdo_source_profile->get_function_instance_by_decl
(current_function_decl, afdo_string_table->filenames()[i]);
+ if (s)
+ break;
+ }
+
if (s == NULL)
{
if (dump_file)
@@ -4222,7 +4273,8 @@ afdo_annotate_cfg (void)
/* create_gcov only dumps symbols with some samples in them.
This means that we get nonempty zero_bbs only if some
nonzero counts in profile were not matched with statements. */
- if (!flag_profile_partial_training)
+ if (!flag_profile_partial_training
+ && !param_auto_profile_reorder_only)
{
FOR_ALL_BB_FN (bb, cfun)
if (bb->count.quality () == GUESSED_LOCAL)
@@ -4232,6 +4284,20 @@ afdo_annotate_cfg (void)
return;
}
+ if (auto it = timestamp_info_map.find (s->timestamp ());
+ it != timestamp_info_map.end ())
+ {
+ cgraph_node *node = cgraph_node::get (current_function_decl);
+ node->tp_first_run = it->second.second;
+
+ if (dump_file)
+ fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
+ node->asm_name (), node->tp_first_run);
+ }
+
+ if (param_auto_profile_reorder_only)
+ return;
+
calculate_dominance_info (CDI_POST_DOMINATORS);
calculate_dominance_info (CDI_DOMINATORS);
loop_optimizer_init (0);
diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-cloning.cc
index 2684046bd2d..1653ea7b961 100644
--- a/gcc/ipa-locality-cloning.cc
+++ b/gcc/ipa-locality-cloning.cc
@@ -994,6 +994,50 @@ locality_determine_static_order (auto_vec<locality_order
*> *order)
return true;
}
+/* Similar to lto-partition.cc:partition_by_timestamp.s */
+
+static locality_partition
+partition_by_timestamps (int64_t max_partition_size, int& npartitions)
+{
+ vec<cgraph_node *> cnodes = vNULL;
+ cgraph_node *node;
+ FOR_EACH_DEFINED_FUNCTION (node)
+ if (node->tp_first_run > 0
+ && node->get_partitioning_class () == SYMBOL_PARTITION
+ && ipa_size_summaries->get (node))
+ cnodes.safe_push (node);
+ cnodes.qsort (tp_first_run_node_cmp);
+
+ unsigned n_partitioned_symbols = 0;
+
+ locality_partition partition = NULL;
+ unsigned i;
+ FOR_EACH_VEC_ELT (cnodes, i, node)
+ {
+ if (node_partitioned_p (node))
+ continue;
+ ipa_size_summary *size_summary = ipa_size_summaries->get (node);
+
+ if (partition == NULL
+ || partition->insns + size_summary->size > max_partition_size)
+ partition = create_partition (npartitions);
+
+ add_node_to_partition (partition, node);
+ n_partitioned_symbols++;
+ }
+
+ if (dump_file)
+ {
+ fprintf (dump_file,
+ "partition_by_timestamps: Total partitions: %d\n", npartitions);
+ fprintf (dump_file,
+ "partition_by_timestamps: Number of partitioned symbols: %d\n",
+ n_partitioned_symbols);
+ }
+
+ return partition;
+}
+
/* Partitioning for code locality.
1. Create and sort callchains. If PGO is available, use real profile
counts. Otherwise, use a set of heuristics to sort the callchains.
@@ -1007,7 +1051,7 @@ locality_partition_and_clone (int
max_locality_partition_size,
lto_locality_cloning_model cloning_model,
int freq_denominator, int size)
{
- locality_partition partition;
+ locality_partition partition = NULL;
int npartitions = 0;
auto_vec<locality_order *> order;
@@ -1040,7 +1084,11 @@ locality_partition_and_clone (int
max_locality_partition_size,
int64_t partition_size
= max_locality_partition_size
? max_locality_partition_size : param_max_partition_size;
- partition = create_partition (npartitions);
+
+ if (flag_auto_profile && flag_profile_reorder_functions)
+ partition = partition_by_timestamps (partition_size, npartitions);
+ if (!partition)
+ partition = create_partition (npartitions);
for (unsigned i = 0; i < order.length (); i++)
{
diff --git a/gcc/lto/lto-partition.cc b/gcc/lto/lto-partition.cc
index 8c6ec6f8338..eb57c357577 100644
--- a/gcc/lto/lto-partition.cc
+++ b/gcc/lto/lto-partition.cc
@@ -97,6 +97,15 @@ create_partition_if_empty ()
new_partition ("empty");
}
+/* Create and return the created partition of name NAME. */
+
+static ltrans_partition
+create_partition (int &npartitions, const char *name)
+{
+ npartitions++;
+ return new_partition (name);
+}
+
/* Free memory used by ltrans partition.
Encoder can be kept to be freed after streaming. */
static void
@@ -1059,6 +1068,50 @@ lto_cache_map (int n_lto_partitions, int
max_partition_size)
partitioner.apply (partitions, n_lto_partitions);
}
+/* Place symbols that are profiled in as few partitions as possible. */
+
+static ltrans_partition
+partition_by_timestamps (int max_partition_size, int &npartitions)
+{
+ vec<cgraph_node *> cnodes = vNULL;
+ cgraph_node *node;
+ FOR_EACH_DEFINED_FUNCTION (node)
+ if (node->tp_first_run > 0
+ && node->get_partitioning_class () == SYMBOL_PARTITION
+ && ipa_size_summaries->get (node))
+ cnodes.safe_push (node);
+ cnodes.qsort (tp_first_run_node_cmp);
+
+ unsigned n_partitioned_symbols = 0;
+
+ ltrans_partition partition = NULL;
+ unsigned i;
+ FOR_EACH_VEC_ELT (cnodes, i, node)
+ {
+ if (symbol_partitioned_p (node))
+ continue;
+
+ ipa_size_summary *size_summary = ipa_size_summaries->get (node);
+ if (partition == NULL
+ || partition->insns + size_summary->size > max_partition_size)
+ partition = create_partition (npartitions, "");
+
+ add_symbol_to_partition (partition, node);
+ n_partitioned_symbols++;
+ }
+
+ if (dump_file)
+ {
+ fprintf (dump_file,
+ "partition_by_timestamps: Total partitions: %d\n", npartitions);
+ fprintf (dump_file,
+ "partition_by_timestamps: Number of partitioned symbols: %d\n",
+ n_partitioned_symbols);
+ }
+
+ return partition;
+}
+
/* Group cgraph nodes into equally-sized partitions.
The partitioning algorithm is simple: nodes are taken in predefined order.
@@ -1114,7 +1167,7 @@ lto_balanced_map (int n_lto_partitions, int
max_partition_size)
int64_t cost = 0, internal = 0;
unsigned int best_n_nodes = 0, best_i = 0;
int64_t best_cost = -1, best_internal = 0, best_size = 0;
- int npartitions;
+ int npartitions = 0;
int current_order = -1;
int noreorder_pos = 0;
@@ -1134,6 +1187,19 @@ lto_balanced_map (int n_lto_partitions, int
max_partition_size)
original_total_size = total_size;
+ /* With -fauto-profile -fprofile-reorder-functions, place all symbols which
+ are profiled together into as few partitions as possible. The rationale
+ for doing this with AutoFDO is that the number of sampled functions is a
+ fraction of total number of executed functions (unlike PGO where each
+ executed function gets instrumented with time profile counter), and
+ placing them together helps with code locality. */
+
+ partition = NULL;
+ if (flag_auto_profile
+ && flag_profile_reorder_functions
+ && noreorder.length () == 0)
+ partition = partition_by_timestamps (max_partition_size, npartitions);
+
/* Streaming works best when the source units do not cross partition
boundaries much. This is because importing function from a source
unit tends to import a lot of global trees defined there. We should
@@ -1168,8 +1234,12 @@ lto_balanced_map (int n_lto_partitions, int
max_partition_size)
partition_size = total_size / n_lto_partitions;
if (partition_size < param_min_partition_size)
partition_size = param_min_partition_size;
- npartitions = 1;
- partition = new_partition ("");
+
+ if (npartitions == 0)
+ {
+ npartitions = 1;
+ partition = new_partition ("");
+ }
if (dump_file)
fprintf (dump_file, "Total unit size: %" PRId64 ", partition size: %"
PRId64 "\n",
total_size, partition_size);
@@ -1512,15 +1582,6 @@ add_node_references_to_partition (ltrans_partition
partition, symtab_node *node)
}
-/* Create and return the created partition of name NAME. */
-
-static ltrans_partition
-create_partition (int &npartitions, const char *name)
-{
- npartitions++;
- return new_partition (name);
-}
-
/* Partitioning for code locality.
The partitioning plan (and prerequisite cloning) will have been done by the
IPA locality cloning pass. This function just implements that plan by
diff --git a/gcc/params.opt b/gcc/params.opt
index ad5ee887367..936c5a2a116 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -70,6 +70,10 @@ Enable asan detection of use-after-return bugs.
Common Joined UInteger Var(param_auto_profile_bbs) Init(1) IntegerRange(0, 1)
Param Optimization
Build basic block profile using auto profile.
+-param=auto-profile-reorder-only=
+Common Joined UInteger Var(param_auto_profile_reorder_only) Init(0)
IntegerRange(0, 1) Param Optimization
+Eanble only function reordering with auto-profile.
+
-param=cycle-accurate-model=
Common Joined UInteger Var(param_cycle_accurate_model) Init(1) IntegerRange(0,
1) Param Optimization
Whether the scheduling description is mostly a cycle-accurate model of the
target processor and is likely to be spill aggressively to fill any pipeline
bubbles.
diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index 0d78f5b384f..34673a970ea 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -610,6 +610,15 @@ default_function_section (tree decl, enum node_frequency
freq,
if (!flag_reorder_functions
|| !targetm_common.have_named_sections)
return NULL;
+
+ /* With -fauto-profile -fprofile-reorder-functions, give higher priority
+ to time profile based reordering, and ensure the reordering isn't split
+ by hot/cold partitioning. */
+ if (flag_auto_profile
+ && flag_profile_reorder_functions
+ && cgraph_node::get (decl)->tp_first_run > 0)
+ return NULL;
+
/* Startup code should go to startup subsection unless it is
unlikely executed (this happens especially with function splitting
where we can split away unnecessary parts of static constructors. */