On Fri, Aug 30, 2013 at 10:58:25AM +0200, Julian Seward wrote: > What we have works, but is less than ideal in a number of ways: > > (1) It's slow. CFI/EXIDX unwinding with Breakpad costs about 6600 > instructions/frame, or around 120+K/instructions per stack trace. > That makes it infeasible for high-frequency, low overhead > profiling. By comparison, Valgrind's CFI based unwinder costs > around 220 instructions per frame -- that's 30 times fewer.
As another data point, the self-contained EXIDX-specific unwinder I'm working on in bug 810526 is on the same order as that, with a relatively small amount of code. There's clearly room for optimization, but it's fast enough for 1 kHz on B2G in its current form (and, if so optimized, would take less time than the non-unwinding parts of the profiler signal handler). > https://blog.mozilla.org/jseward/2013/08/29/how-fast-can-cfiexidx-based-stack-unwinding-be/ Re unwinding jitcode, is that on x86/x86_64 and using frame pointers, by any chance? I seem to recall that the Baseline compiler's stack frames include an x86-style frame pointer (which is unhelpful on ARM platforms that aren't iOS), which Breakpad might be able to heuristically detect. IonMonkey might do that too, but I don't know it as well. My understanding is that we don't have real unwind info for jitcode yet, but it's a subject that's been looked into at one point or another, and which I think ought to happen at some point. > (2) It has a prohibitively large footprint for Fennec and in > particular for FirefoxOS. My rough measurements show it taking > about 100MB of memory to load the CFI unwind rules for libxul.so > on ARM. I don't know if this is the case for CFI, but the ARM Exception Handling ABI (what we're calling EXIDX) was designed to be interpreted in-place, with effectively no memory overhead (other than what the OS uses to cache the faulted-in parts of the tables). (In practice this doesn't quite work on ICS B2G, because the linker is buggy and the exception index may not be entirely sorted, but a meta-index to work around it is only 350 KiB for libxul (and libxul may not be affected; I first saw this in libc, which is much smaller) (and I think newer versions fixed the bug).) More generally: given that the B2G libxul has 6754480 bytes of .debug_frame, but only 726472 bytes of .ARM.exidx and 47140 of .ARM.extab, even if we need an in-memory representation of the entire unwind info it should be possible to keep it relatively small. > (4) (not directly related to Breakpad, but..) Current SPS samples > threads at some constant rate regardless of whether they are > running or blocked in syscalls. The latter constitutes a big > waste of compute resources, since most threads are blocked in > syscalls most of the time. It would be nice to devise an adaptive > strategy that samples blocked threads less often. For what it's worth, using the Linux perf_event subsystem instead of our signal handler could help here... but then we'd need an unwinder that can get the information we want, asynchronously, with just a snapshot of the thread's registers and the top N bytes of its stack (i.e., not the pseudostack or anything like that, unless we make Gecko-specific changes to the kernel(!)). Also, non-x86 platforms don't have kernel support for user stack/register capture yet; there are some bugs about this hanging off of 831611. --Jed _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform