When benchmarking, there is often noise from semi-random cache layout issues. If the program in question has a code working set size that fits in the instruction cache, I this nose can be mostly attributed to thrashing because of unfortunate code layout.
If we could lay out the code working set within a contigous memory region not larger than the instruction cache, this thrashing should stop. Obviously, this could be best done with profile based feedback, but this can be a time-consuming process, not only to do the actually program runs, but also because it requires to change the way the program is built. So I was wodering if we could get a good first-order approximation by placing library code that is called frequently together with the code that is calling it. In particular think this would be beneficial for libgcc functions for integer division and floating point arithmetic. These functions are also in the implementation namespace, so the risk of breaking a program by doing unconventional things with link order would be lowered. I think we could use the existing heuristics to determine hot / maybe hot / cold blocks to decide if a function call is relevant for code layout. Compiler options could be used to control if to do this for normal blocks, hot blocks only, or not at all. We can place the selected library function at the start of the link by using -u options on then and -lgcc after that (but before the objects). If the main program is reasonably small, but the total of library code included eventually is large, this arrangement can get us from a working set speard over an area larger than the cache size to one that fits within the cache size. The remaining question is how to best get the information from the compiler proper (cc1 / cc1plus etc) to the linker. Should the compiler write a temporary file, which is then read by the compiler driver to construct the link line?