Hi all, (please CC me in replies, not a list member)
I have a large C++ app that throws exceptions to unwind anywhere from 5-20 stack frames when an error prevents the request from being served (which happens rather frequently). Works fine single-threaded, but performance is terrible for 24 threads on a 48-thread Ubuntu 10 machine. Profiling points to a global mutex acquire in __GI___dl_iterate_phdr as the culprit, with _Unwind_Find_FDE as its caller.
Tracing the attached test case with the attached gdb script shows that the -DDTOR case executes ~20k instructions during unwind and calls iterate_phdr 12 times. The -DTRY case executes ~33k instructions and calls iterate_phdr 18 times. The exception in this test case only affects three stack frames, with minimal cleanup required, and the trace is taken on the second call to the function that swallows the error, to warm up libgcc's internal caches [1].
The instruction counts aren't terribly surprising---I know unwinding is complex---but might it be possible to throw and catch a previously-seen exception through a previously-seen stack trace, with something fewer than 4-6 global mutex acquires for each frame unwound? As it stands, the deeper the stack trace (= the more desirable to throw rather than return an error), the more of a scalability bottleneck unwinding becomes. My actual app would apparently suffer anywhere from 25 to 80 global mutex acquires for each exception thrown, which probably explains why the bottleneck arises...
I'm bringing the issue up here, rather than filing a bug, because I'm not sure whether this is an oversight, a known problem that's hard to fix, or a feature (e.g. somehow required for reliable unwinding). I suspect the former, because _Unwind_Find_FDE tries a call to _Unwind_Find_registered_FDE before falling back to dl_iterate_phdr, but the former never succeeds in my trace (iterate_phdr is always called).
FWIW, I've tested both gcc-4.6 and 4.8 but see no meaningful difference between them.
[1] The cache can be seen in libgcc/unwind-dw2-fde-dip.c, though they will do little to prevent mutex bottlenecks because they're accessed from the iterate_phdr callback, behind the mutex acuqire.
Thoughts? Ryan
#include <cstdio> void ding() { fputs("Ding!\n", stderr); } struct ding_unless { bool commit; ding_unless() : commit(false) { } ~ding_unless() { if (not commit) ding(); } }; void __attribute__((noinline)) sentinel() { printf("Done\n"); } int __attribute__((noinline)) foo() { throw 42; } int __attribute__((noinline)) bar() { #ifdef TRY try { return 1+foo(); } catch (...) { ding(); throw; } #elif defined(DTOR) ding_unless x; int ans = 1+foo(); x.commit = true; return ans; #endif } int __attribute__((noinline)) baz() { int ans; try { ans = 1+bar(); } catch (...) { ans = -1; } sentinel(); return ans; } int main() { baz(); baz(); return 0; }
b baz r c display/i $pc si while ($pc != sentinel) si end k q