Hi all,

I'd like to continue the discussion on teaching GCC to optimise code layout
for locality between callees and callers. This is work that we've been doing
at NVIDIA, primarily Prachi Godbole (CC'ed) and myself.
This is a follow-up to the discussion we had at GNU Cauldron at the IPA/LTO
BoF [1]. We're pretty far along in some implementation aspects, but some
implementation and evaluation areas have questions that we'd like advice on.

Goals and motivation:
For some CPUs it is beneficial to minimise the branch distance between
frequently called functions. This effect is more pronounced for large
applications, sometimes composed of multiple APIs/modules where each module
has deep call chains that form a sort of callgraph cluster. The effect is more
pronounced for multi-DSO applications but solving this cross-DSO is out of
scope for this work. We see value in having GCC minimising the branching
distances within even single applications.

Design:
To perform this the compiler needs to see as much of the callgraph as possible
so naturally this should be performed during LTO. Profile data from PGO is
needed to determine the hot caller/calle relationships so we require that as
well. However, it is possible that static non-PGO heuristics could do a
good-enough job in many cases, but we haven't experimented with them
extensively thus far. The optimisation performs two things:
1) It partitions the callgraph into clusters based on the caller/callee hotness
and groups the functions within those clusters together.
2) For functions in the callgraph that crass cluster boundaries we perform
cloning so that each clone can be grouped close to their cluster for locality.

Implementation:
The partitioning 1) is done at the LTO partitioning stage through a new option
to -flto-partition. We add -flto-partition=locality. At the Cauldron Richi
suggested that maybe we could have a separated dedicated
clustering/partitioning pass for this, I'd like to check whether that's indeed
the direction we want to take.
The cloning 2) is done separately in an IPA pass we're calling "IPA locality
cloning". This is currently run after pass_ipa_inline and before
pass_ipa_pure_const. We found that trying to do both partitioning and cloning
in the same pass hit all kinds of asserts about function summaries not being
valid.

Remaining TODOs, issues:
* For testing we added a bootstrap-lto-locality configuration that enables this
optimisation for GCC bootstrap. Currently the code bootstraps successfully
with LTO bootstrap and profiledbootstrap with the locality partitioning and
cloning enabled. This gives us confidence that nothing is catastrophically
wrong with the code.

* The bulk of the work was developed against GCC 12 because the motivating use
case is a large internal workload that only supported GCC 12. We've rebased
the work to GCC trunk and updated the code to bootstrap and test there, but
we'd appreciate the usual code review that it uses the appropriate GCC 15 APIs.
We're open to ideas about integrating these optimisations with existing passes
to avoid duplication where possible.

* Thanks to Honza for pointing out previous work in this area by Martin Liska 
[2]
that proposes a -freorder-functions-algorithm=call-chain-clustering option.
This looks like good work that we'd be interesting in seeing go in. We haven't
evaluated it yet ourselves but one thing it's missing is the cloning from our
approach. Also the patch seems to rely on a .text.sorted. section in the
linker. Is that because we're worried about the linker doing further
reordering of functions that invalidates this optimisation? Could we do this
optimisation without the linker section? We're currently viewing it as an
orthogonal optimisation that should be pursued on its own, but are interested
in other ideas.

* The size of the clusters depends on the microarchitecture and I think we'd
want to control its size through something like a param that target code can
set. We currently have a number of params that we added that control various
aggressiveness settings around cluster size and cloning. We would want to
have sensible defaults or deduce them from code analysis if possible.

* In the absence of PGO data we're interested in developing some static
heuristics to guide this. One area where we'd like advice is how to detect
functions that have been instantiated from the same template, as we find that
they are usually the kind of functions that we want to keep together.
We are exploring a few options here and if we find something that works we’ll
propose them.

* Our prototype gives measurable differences in the large internal app that
motivated this work. We will be performing more benchmarking on workloads that
we can share with the community, but generally the idea of laying out code to
maximise locality is now an established Good Thing (TM) in toolchains given the
research from Facebook that Martin quotes [2] and the invention of tools like
BOLT. So I'm hoping the motivation for this work isn't particularly
controversial. We will be doing more benchmarking and evaluation in the coming
weeks.

We're grateful for feedback on how we can get this into mainline.
Thanks!
Kyrill

[1] https://youtu.be/bWH4_in2s0o?list=PL_GiHdX17Wtye5q4zPVnJc6ec7PdnUNil&t=244
[2] https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01142.html

Attachment: 0001-RFC-Locality-cloning-and-partitioning.patch
Description: 0001-RFC-Locality-cloning-and-partitioning.patch

Reply via email to