Hi,
When dynamically linking a fast enough machine hides the latency, but when
Statically linking or on slower devices this change caused a 5x increase in
Instruction count and 2x increase in cycle count before getting to main.
I have looked at ways to fix that. The problem is that with static
linking unwinding tables are registered dynamically, and with my patch
that registration triggers an eager sort of fde lists. While previously
the lists were sorted when the first exception was thrown. If an
application throws at least one exception there is no downside in eager
sorting, but if the application never throws there is overhead.
The obvious way to improve the situation is to make sorting faster. When
replacing the split+sort+merge logic with a radix sort (which can be
done without additional memory) we get the following timings for your
#include <cstdio>
int main() {}
example (with git stat -r 200):
# pre-patch version, fast
0,06 msec task-clock
272.286 cycles
464.754 instructions
# post-patch version, slow
0,21 msec task-clock
972.876 cycles
3.079.515 instructions
# +radix sort, in the middle
0,13 msec task-clock
604.697 cycles
1.702.930 instructions
The radix sort clearly improves things, but it does not fully eliminate
the overhead.
The question is, how much do we care about that situation (i.e., static
linking, exceptions registered but never thrown). I could change the
code to recognize three states instead of two: no exceptions registered,
exceptions register but never thrown, and full exception mode. But that
would increase code complexity and it would pessimize applications that
do throw, as we now need more checks to guard against concurrent changes.
It makes the code more complex and a bit slower, which is why I am
hesistant. But I can implement that if you think that we need that. Or
we just replace the sort, which is probably a good idea anyway.
Best
Thomas