On Mon, May 27, 2013 at 3:20 PM, Ryan Johnson <ryan.john...@cs.utoronto.ca> wrote: > > I have a large C++ app that throws exceptions to unwind anywhere from 5-20 > stack frames when an error prevents the request from being served (which > happens rather frequently). Works fine single-threaded, but performance is > terrible for 24 threads on a 48-thread Ubuntu 10 machine. Profiling points > to a global mutex acquire in __GI___dl_iterate_phdr as the culprit, with > _Unwind_Find_FDE as its caller. > > Tracing the attached test case with the attached gdb script shows that the > -DDTOR case executes ~20k instructions during unwind and calls iterate_phdr > 12 times. The -DTRY case executes ~33k instructions and calls iterate_phdr > 18 times. The exception in this test case only affects three stack frames, > with minimal cleanup required, and the trace is taken on the second call to > the function that swallows the error, to warm up libgcc's internal caches > [1]. > > The instruction counts aren't terribly surprising---I know unwinding is > complex---but might it be possible to throw and catch a previously-seen > exception through a previously-seen stack trace, with something fewer than > 4-6 global mutex acquires for each frame unwound? As it stands, the deeper > the stack trace (= the more desirable to throw rather than return an error), > the more of a scalability bottleneck unwinding becomes. My actual app would > apparently suffer anywhere from 25 to 80 global mutex acquires for each > exception thrown, which probably explains why the bottleneck arises... > > I'm bringing the issue up here, rather than filing a bug, because I'm not > sure whether this is an oversight, a known problem that's hard to fix, or a > feature (e.g. somehow required for reliable unwinding). I suspect the > former, because _Unwind_Find_FDE tries a call to _Unwind_Find_registered_FDE > before falling back to dl_iterate_phdr, but the former never succeeds in my > trace (iterate_phdr is always called).
The issue is dlclose followed by dlopen. If we had a cache ahead of dl_iterate_phdr, we would need some way to clear out any information cached from a dlclose'd library. Otherwise we might pick up the old information when looking up an address from a new dlopen. So 1) locking will always be required; 2) any caching system to reduce the number of locks will require support for dlclose, somehow. It's worth working on but there isn't going to be a simple solution. Ian