Hi all, I'd like to continue the discussion on teaching GCC to optimise code layout for locality between callees and callers. This is work that we've been doing at NVIDIA, primarily Prachi Godbole (CC'ed) and myself. This is a follow-up to the discussion we had at GNU Cauldron at the IPA/LTO BoF [1]. We're pretty far along in some implementation aspects, but some implementation and evaluation areas have questions that we'd like advice on.
Goals and motivation: For some CPUs it is beneficial to minimise the branch distance between frequently called functions. This effect is more pronounced for large applications, sometimes composed of multiple APIs/modules where each module has deep call chains that form a sort of callgraph cluster. The effect is more pronounced for multi-DSO applications but solving this cross-DSO is out of scope for this work. We see value in having GCC minimising the branching distances within even single applications. Design: To perform this the compiler needs to see as much of the callgraph as possible so naturally this should be performed during LTO. Profile data from PGO is needed to determine the hot caller/calle relationships so we require that as well. However, it is possible that static non-PGO heuristics could do a good-enough job in many cases, but we haven't experimented with them extensively thus far. The optimisation performs two things: 1) It partitions the callgraph into clusters based on the caller/callee hotness and groups the functions within those clusters together. 2) For functions in the callgraph that crass cluster boundaries we perform cloning so that each clone can be grouped close to their cluster for locality. Implementation: The partitioning 1) is done at the LTO partitioning stage through a new option to -flto-partition. We add -flto-partition=locality. At the Cauldron Richi suggested that maybe we could have a separated dedicated clustering/partitioning pass for this, I'd like to check whether that's indeed the direction we want to take. The cloning 2) is done separately in an IPA pass we're calling "IPA locality cloning". This is currently run after pass_ipa_inline and before pass_ipa_pure_const. We found that trying to do both partitioning and cloning in the same pass hit all kinds of asserts about function summaries not being valid. Remaining TODOs, issues: * For testing we added a bootstrap-lto-locality configuration that enables this optimisation for GCC bootstrap. Currently the code bootstraps successfully with LTO bootstrap and profiledbootstrap with the locality partitioning and cloning enabled. This gives us confidence that nothing is catastrophically wrong with the code. * The bulk of the work was developed against GCC 12 because the motivating use case is a large internal workload that only supported GCC 12. We've rebased the work to GCC trunk and updated the code to bootstrap and test there, but we'd appreciate the usual code review that it uses the appropriate GCC 15 APIs. We're open to ideas about integrating these optimisations with existing passes to avoid duplication where possible. * Thanks to Honza for pointing out previous work in this area by Martin Liska [2] that proposes a -freorder-functions-algorithm=call-chain-clustering option. This looks like good work that we'd be interesting in seeing go in. We haven't evaluated it yet ourselves but one thing it's missing is the cloning from our approach. Also the patch seems to rely on a .text.sorted. section in the linker. Is that because we're worried about the linker doing further reordering of functions that invalidates this optimisation? Could we do this optimisation without the linker section? We're currently viewing it as an orthogonal optimisation that should be pursued on its own, but are interested in other ideas. * The size of the clusters depends on the microarchitecture and I think we'd want to control its size through something like a param that target code can set. We currently have a number of params that we added that control various aggressiveness settings around cluster size and cloning. We would want to have sensible defaults or deduce them from code analysis if possible. * In the absence of PGO data we're interested in developing some static heuristics to guide this. One area where we'd like advice is how to detect functions that have been instantiated from the same template, as we find that they are usually the kind of functions that we want to keep together. We are exploring a few options here and if we find something that works we’ll propose them. * Our prototype gives measurable differences in the large internal app that motivated this work. We will be performing more benchmarking on workloads that we can share with the community, but generally the idea of laying out code to maximise locality is now an established Good Thing (TM) in toolchains given the research from Facebook that Martin quotes [2] and the invention of tools like BOLT. So I'm hoping the motivation for this work isn't particularly controversial. We will be doing more benchmarking and evaluation in the coming weeks. We're grateful for feedback on how we can get this into mainline. Thanks! Kyrill [1] https://youtu.be/bWH4_in2s0o?list=PL_GiHdX17Wtye5q4zPVnJc6ec7PdnUNil&t=244 [2] https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01142.html
0001-RFC-Locality-cloning-and-partitioning.patch
Description: 0001-RFC-Locality-cloning-and-partitioning.patch