> On 26 Mar 2025, at 08:42, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > > Ping.
Ping. https://gcc.gnu.org/pipermail/gcc-patches/2025-March/676958.html I’ve ran a profiled LTO bootstrap of GCC with the new bootstrap-lto-locality bootstrap config And compared it against a GCC produced by the existing lto-bootstrap. On an AArch64 Grace system I have I see about 0.5% faster compilation of insn-recog-4.ii taken from the bootstrap artifacts. Given we’re proposing this as an off-by-default optimization (as it is incompatible with explicit -flto-partition and currently requires PGO data to do a decent job) is it okay to get it into GCC 15? We have further work planned to make it work better with AutoFDO (non-instrumented perf) data in the GCC 16 timeframe but what we have now is useful for our users and it would make it much easier to have this in GCC 15 as a reference. Thanks, Kyrill > > Thanks, > Kyrill > >> On 6 Mar 2025, at 09:25, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >> >> Hi all, >> >> Implement partitioning and cloning in the callgraph to help locality. >> A new -fipa-reorder-for-locality flag is used to enable this. >> The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc >> The optimization has two components: >> * Partitioning the callgraph so as to group callers and callees that >> frequently >> call each other in the same partition >> * Cloning functions that straddle multiple callchains and allowing each clone >> to be local to the partition of its callchain. >> >> The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc. >> It creates a partitioning plan and does the prerequisite cloning. >> The partitioning is then implemented during the existing LTO partitioning >> pass. >> >> To guide these locality heuristics we use PGO data. >> In the absence of PGO data we use a static heuristic that uses the >> accumulated >> estimated edge frequencies of the callees for each function to guide the >> reordering. >> We are investigating some more elaborate static heuristics, in particular >> using >> the demangled C++ names to group template instantiatios together. >> This is promising but we are working out some kinks in the implementation >> currently and want to send that out as a follow-up once we're more confident >> in it. >> >> A new bootstrap-lto-locality bootstrap config is added that allows us to test >> this on GCC itself with either static or PGO heuristics. >> GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap). >> >> As this new pass enables a new partitioning scheme it is incompatible with >> explicit -flto-partition= options so an error is introduced when the user >> uses both flags explicitly. >> >> With this optimization we are seeing good performance gains on some large >> internal workloads that stress the parts of the processor that is sensitive >> to code locality, but we'd appreciate wider performance evaluation. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> Ok for mainline? >> Thanks, >> Kyrill >> >> Signed-off-by: Prachi Godbole <pgodb...@nvidia.com> >> Co-authored-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> config/ChangeLog: >> >> * bootstrap-lto-locality.mk: New file. >> >> gcc/ChangeLog: >> >> * Makefile.in (OBJS): Add ipa-locality-cloning.o. >> * cgraph.h (set_new_clone_decl_and_node_flags): Declare prototype. >> * cgraphclones.cc (set_new_clone_decl_and_node_flags): Remove static >> qualifier. >> * common.opt (fipa-reorder-for-locality): New flag. >> (LTO_PARTITION_DEFAULT): Declare. >> (flto-partition): Change default to LTO_PARTITION_DFEAULT. >> * doc/invoke.texi: Document -fipa-reorder-for-locality. >> * flag-types.h (enum lto_locality_cloning_model): Declare. >> (lto_partitioning_model): Add LTO_PARTITION_DEFAULT. >> * lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add dumping of >> node and index. >> * opts.cc (validate_ipa_reorder_locality_lto_partition): Define. >> (finish_options): Handle LTO_PARTITION_DEFAULT. >> * params.opt (lto_locality_cloning_model): New enum. >> (lto-partition-locality-cloning): New param. >> (lto-partition-locality-frequency-cutoff): Likewise. >> (lto-partition-locality-size-cutoff): Likewise. >> (lto-max-locality-partition): Likewise. >> * passes.def: Register pass_ipa_locality_cloning. >> * timevar.def (TV_IPA_LC): New timevar. >> * tree-pass.h (make_pass_ipa_locality_cloning): Declare. >> * ipa-locality-cloning.cc: New file. >> * ipa-locality-cloning.h: New file. >> >> gcc/lto/ChangeLog: >> >> * lto-partition.cc (add_node_references_to_partition): Define. >> (create_partition): Likewise. >> (lto_locality_map): Likewise. >> (lto_promote_cross_file_statics): Add extra dumping. >> * lto-partition.h (lto_locality_map): Declare prototype. >> * lto.cc (do_whole_program_analysis): Handle >> flag_ipa_reorder_for_locality. >> >> <0001-Locality-cloning-pass-was-Introduce-flto-partition-l.patch> >