When benchmarking, there is often noise from semi-random cache
layout issues.  If the program in question has a code working set
size that fits in the instruction cache, I this nose can be mostly
attributed to thrashing because of unfortunate code layout.

If we could lay out the code working set within a contigous memory
region not larger than the instruction cache, this thrashing should
stop.

Obviously, this could be best done with profile based feedback, but
this can be a time-consuming process, not only to do the actually
program runs, but also because it requires to change the way the
program is built.

So I was wodering if we could get a good first-order approximation
by placing library code that is called frequently together with the
code that is calling it.  In particular think this would be beneficial for
libgcc functions for integer division and floating point arithmetic.
These functions are also in the implementation namespace, so the risk
of breaking a program by doing unconventional things with link order
would be lowered.
I think we could use the existing heuristics to determine hot /
maybe hot / cold blocks to decide if a function call is relevant for
code layout.  Compiler options could be used to control if to do
this for normal blocks, hot blocks only, or not at all.

We can place the selected library function at the start of the link
by using -u options on then and -lgcc after that (but before the
objects).  If the main program is reasonably small, but the total of
library code included eventually is large, this arrangement can get us
from a working set speard over an area larger than the cache size to
one that fits within the cache size.

The remaining question is how to best get the information from the
compiler proper (cc1 / cc1plus etc) to the linker.
Should the compiler write a temporary file, which is then read by the
compiler driver to construct the link line?


Reply via email to