On 05/28/2013 12:20 AM, Ryan Johnson wrote: > Hi all, > > (please CC me in replies, not a list member) > > I have a large C++ app that throws exceptions to unwind anywhere from > 5-20 stack frames when an error prevents the request from being served > (which happens rather frequently). Works fine single-threaded, but > performance is terrible for 24 threads on a 48-thread Ubuntu 10 machine. > Profiling points to a global mutex acquire in __GI___dl_iterate_phdr as > the culprit, with _Unwind_Find_FDE as its caller. > > Tracing the attached test case with the attached gdb script shows that > the -DDTOR case executes ~20k instructions during unwind and calls > iterate_phdr 12 times. The -DTRY case executes ~33k instructions and > calls iterate_phdr 18 times. The exception in this test case only > affects three stack frames, with minimal cleanup required, and the trace > is taken on the second call to the function that swallows the error, to > warm up libgcc's internal caches [1]. > > The instruction counts aren't terribly surprising---I know unwinding is > complex---but might it be possible to throw and catch a previously-seen > exception through a previously-seen stack trace, with something fewer > than 4-6 global mutex acquires for each frame unwound? As it stands, the > deeper the stack trace (= the more desirable to throw rather than return > an error), the more of a scalability bottleneck unwinding becomes. My > actual app would apparently suffer anywhere from 25 to 80 global mutex > acquires for each exception thrown, which probably explains why the > bottleneck arises... > > I'm bringing the issue up here, rather than filing a bug, because I'm > not sure whether this is an oversight, a known problem that's hard to > fix, or a feature (e.g. somehow required for reliable unwinding). I > suspect the former, because _Unwind_Find_FDE tries a call to > _Unwind_Find_registered_FDE before falling back to dl_iterate_phdr, but > the former never succeeds in my trace (iterate_phdr is always called). > > FWIW, I've tested both gcc-4.6 and 4.8 but see no meaningful difference > between them. > > [1] The cache can be seen in libgcc/unwind-dw2-fde-dip.c, though they > will do little to prevent mutex bottlenecks because they're accessed > from the iterate_phdr callback, behind the mutex acuqire. If the bottleneck is really in glibc, then you should probably ask them to fix it. Could the mutex be changed rwlock instead?
-- VZ
signature.asc
Description: OpenPGP digital signature