Ping.
Thanks,
Kyrill
> On 28 Nov 2024, at 11:22, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>
> Ping.
>
>> On 15 Nov 2024, at 17:04, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>>
>> Hi all,
>>
>> This is a patch submission following-up from the RFC at:
>> https://gcc.gnu.org/pipermail/gcc/2024-November/245076.html
>> The patch is rebased and retested against current trunk, some debugging code
>> removed, comments improved and some fixes added as I've we've done more
>> testing.
>>
>> ------------------------>8-----------------------------------------------------
>> Implement partitioning and cloning in the callgraph to help locality.
>> A new -flto-partition=locality flag is used to enable this.
>> The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc
>> The optimization has two components:
>> * Partitioning the callgraph so as to group callers and callees that
>> frequently
>> call each other in the same partition
>> * Cloning functions that straddle multiple callchains and allowing each clone
>> to be local to the partition of its callchain.
>>
>> The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc.
>> It creates a partitioning plan and does the prerequisite cloning.
>> The partitioning is then implemented during the existing LTO partitioning
>> pass.
>>
>> To guide these locality heuristics we use PGO data.
>> In the absence of PGO data we use a static heuristic that uses the
>> accumulated
>> estimated edge frequencies of the callees for each function to guide the
>> reordering.
>> We are investigating some more elaborate static heuristics, in particular
>> using
>> the demangled C++ names to group template instantiatios together.
>> This is promising but we are working out some kinks in the implementation
>> currently and want to send that out as a follow-up once we're more confident
>> in it.
>>
>> A new bootstrap-lto-locality bootstrap config is added that allows us to test
>> this on GCC itself with either static or PGO heuristics.
>> GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap).
>>
>> With this optimization we are seeing good performance gains on some large
>> internal workloads that stress the parts of the processor that is sensitive
>> to code locality, but we'd appreciate wider performance evaluation.
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Ok for mainline?
>> Thanks,
>> Kyrill
>>
>> Signed-off-by: Prachi Godbole <pgodb...@nvidia.com>
>> Co-authored-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>
>> config/ChangeLog:
>> * bootstrap-lto-locality.mk: New file.
>>
>> gcc/ChangeLog:
>> * Makefile.in (OBJS): Add ipa-locality-cloning.o
>> (GTFILES): Add ipa-locality-cloning.cc dependency.
>> * common.opt (lto_partition_model): Add locality value.
>> * flag-types.h (lto_partition_model): Add LTO_PARTITION_LOCALITY
>> value.
>> (enum lto_locality_cloning_model): Define.
>> * lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add dumping
>> of node
>> and index.
>> * params.opt (lto_locality_cloning_model): New enum.
>> (lto-partition-locality-cloning): New param.
>> (lto-partition-locality-frequency-cutoff): Likewise.
>> (lto-partition-locality-size-cutoff): Likewise.
>> (lto-max-locality-partition): Likewise.
>> * passes.def: Add pass_ipa_locality_cloning.
>> * timevar.def (TV_IPA_LC): New timevar.
>> * tree-pass.h (make_pass_ipa_locality_cloning): Declare.
>> * ipa-locality-cloning.cc: New file.
>> * ipa-locality-cloning.h: New file.
>>
>> gcc/lto/ChangeLog:
>> * lto-partition.cc: Include ipa-locality-cloning.h
>> (add_node_references_to_partition): Define.
>> (create_partition): Likewise.
>> (lto_locality_map): Likewise.
>> (lto_promote_cross_file_statics): Add extra dumping.
>> * lto-partition.h (lto_locality_map): Declare.
>> * lto.cc (do_whole_program_analysis): Handle
>> LTO_PARTITION_LOCALITY.
>>
>> <0001-Introduce-flto-partition-locality.patch>
>