Ping.
Thanks,
Kyrill

> On 13 Dec 2024, at 16:47, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
> 
> Ping.
> Thanks,
> Kyrill
> 
>> On 28 Nov 2024, at 11:22, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>> 
>> Ping.
>> 
>>> On 15 Nov 2024, at 17:04, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> This is a patch submission following-up from the RFC at:
>>> https://gcc.gnu.org/pipermail/gcc/2024-November/245076.html
>>> The patch is rebased and retested against current trunk, some debugging code
>>> removed, comments improved and some fixes added as I've we've done more
>>> testing.
>>> 
>>> ------------------------>8-----------------------------------------------------
>>> Implement partitioning and cloning in the callgraph to help locality.
>>> A new -flto-partition=locality flag is used to enable this.
>>> The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc
>>> The optimization has two components:
>>> * Partitioning the callgraph so as to group callers and callees that 
>>> frequently
>>> call each other in the same partition
>>> * Cloning functions that straddle multiple callchains and allowing each 
>>> clone
>>> to be local to the partition of its callchain.
>>> 
>>> The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc.
>>> It creates a partitioning plan and does the prerequisite cloning.
>>> The partitioning is then implemented during the existing LTO partitioning 
>>> pass.
>>> 
>>> To guide these locality heuristics we use PGO data.
>>> In the absence of PGO data we use a static heuristic that uses the 
>>> accumulated
>>> estimated edge frequencies of the callees for each function to guide the
>>> reordering.
>>> We are investigating some more elaborate static heuristics, in particular 
>>> using
>>> the demangled C++ names to group template instantiatios together.
>>> This is promising but we are working out some kinks in the implementation
>>> currently and want to send that out as a follow-up once we're more confident
>>> in it.
>>> 
>>> A new bootstrap-lto-locality bootstrap config is added that allows us to 
>>> test
>>> this on GCC itself with either static or PGO heuristics.
>>> GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap).
>>> 
>>> With this optimization we are seeing good performance gains on some large
>>> internal workloads that stress the parts of the processor that is sensitive
>>> to code locality, but we'd appreciate wider performance evaluation.
>>> 
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>> Ok for mainline?
>>> Thanks,
>>> Kyrill
>>> 
>>> Signed-off-by: Prachi Godbole <pgodb...@nvidia.com>
>>> Co-authored-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> 
>>>  config/ChangeLog:
>>>           * bootstrap-lto-locality.mk: New file.
>>> 
>>>   gcc/ChangeLog:
>>>          * Makefile.in (OBJS): Add ipa-locality-cloning.o
>>>          (GTFILES): Add ipa-locality-cloning.cc dependency.
>>>          * common.opt (lto_partition_model): Add locality value.
>>>          * flag-types.h (lto_partition_model): Add LTO_PARTITION_LOCALITY 
>>> value.
>>>          (enum lto_locality_cloning_model): Define.
>>>          * lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add dumping 
>>> of node
>>>          and index.
>>>          * params.opt (lto_locality_cloning_model): New enum.
>>>          (lto-partition-locality-cloning): New param.
>>>          (lto-partition-locality-frequency-cutoff): Likewise.
>>>          (lto-partition-locality-size-cutoff): Likewise.
>>>          (lto-max-locality-partition): Likewise.
>>>          * passes.def: Add pass_ipa_locality_cloning.
>>>          * timevar.def (TV_IPA_LC): New timevar.
>>>          * tree-pass.h (make_pass_ipa_locality_cloning): Declare.
>>>          * ipa-locality-cloning.cc: New file.
>>>          * ipa-locality-cloning.h: New file.
>>> 
>>>    gcc/lto/ChangeLog:
>>>               * lto-partition.cc: Include ipa-locality-cloning.h
>>>          (add_node_references_to_partition): Define.
>>>          (create_partition): Likewise.
>>>          (lto_locality_map): Likewise.
>>>          (lto_promote_cross_file_statics): Add extra dumping.
>>>          * lto-partition.h (lto_locality_map): Declare.
>>>          * lto.cc (do_whole_program_analysis): Handle 
>>> LTO_PARTITION_LOCALITY.
>>> 
>>> <0001-Introduce-flto-partition-locality.patch>
>> 
> 

Reply via email to