On Fri, Aug 30, 2013 at 10:58:25AM +0200, Julian Seward wrote:
> What we have works, but is less than ideal in a number of ways:
> 
> (1) It's slow.  CFI/EXIDX unwinding with Breakpad costs about 6600
>     instructions/frame, or around 120+K/instructions per stack trace.
>     That makes it infeasible for high-frequency, low overhead
>     profiling.  By comparison, Valgrind's CFI based unwinder costs
>     around 220 instructions per frame -- that's 30 times fewer.

As another data point, the self-contained EXIDX-specific unwinder I'm
working on in bug 810526 is on the same order as that, with a relatively
small amount of code.  There's clearly room for optimization, but it's
fast enough for 1 kHz on B2G in its current form (and, if so optimized,
would take less time than the non-unwinding parts of the profiler signal
handler).

> https://blog.mozilla.org/jseward/2013/08/29/how-fast-can-cfiexidx-based-stack-unwinding-be/

Re unwinding jitcode, is that on x86/x86_64 and using frame pointers, by
any chance?  I seem to recall that the Baseline compiler's stack frames
include an x86-style frame pointer (which is unhelpful on ARM platforms
that aren't iOS), which Breakpad might be able to heuristically detect.
IonMonkey might do that too, but I don't know it as well.

My understanding is that we don't have real unwind info for jitcode yet,
but it's a subject that's been looked into at one point or another, and
which I think ought to happen at some point.

> (2) It has a prohibitively large footprint for Fennec and in
>     particular for FirefoxOS.  My rough measurements show it taking
>     about 100MB of memory to load the CFI unwind rules for libxul.so
>     on ARM.

I don't know if this is the case for CFI, but the ARM Exception Handling
ABI (what we're calling EXIDX) was designed to be interpreted in-place,
with effectively no memory overhead (other than what the OS uses to cache
the faulted-in parts of the tables).

(In practice this doesn't quite work on ICS B2G, because the linker
is buggy and the exception index may not be entirely sorted, but a
meta-index to work around it is only 350 KiB for libxul (and libxul may
not be affected; I first saw this in libc, which is much smaller) (and I
think newer versions fixed the bug).)

More generally: given that the B2G libxul has 6754480 bytes of
.debug_frame, but only 726472 bytes of .ARM.exidx and 47140 of
.ARM.extab, even if we need an in-memory representation of the entire
unwind info it should be possible to keep it relatively small.

> (4) (not directly related to Breakpad, but..) Current SPS samples
>     threads at some constant rate regardless of whether they are
>     running or blocked in syscalls.  The latter constitutes a big
>     waste of compute resources, since most threads are blocked in
>     syscalls most of the time.  It would be nice to devise an adaptive
>     strategy that samples blocked threads less often.

For what it's worth, using the Linux perf_event subsystem instead of our
signal handler could help here... but then we'd need an unwinder that
can get the information we want, asynchronously, with just a snapshot of
the thread's registers and the top N bytes of its stack (i.e., not the
pseudostack or anything like that, unless we make Gecko-specific changes
to the kernel(!)).  Also, non-x86 platforms don't have kernel support for
user stack/register capture yet; there are some bugs about this hanging
off of 831611.

--Jed

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to