Hi, I recently experimented with changes to libgcc's DWARF unwinder, aimed at improving performance on Android, and I'm wondering if upstream would be interested in a patch.
The most interesting change modified the cache in unwind-dw2-fde-dip.c to help even when the dlpi_adds and dlpi_subs fields are missing. Normally, the unwinder uses these fields to invalidate the cache (e.g. because a module was unloaded), so if the fields are missing, the cache is disabled. Typically, the fields are present. (e.g. They've been in glibc since 2.4. FreeBSD, NetBSD, and illumos also have them.) Unfortunately, while I've added the two fields for the next Android release, existing releases lack the fields and will be around for a few years. While I can't update the dynamic linker in old versions of Android, Android apps bundle their own unwinders, which can be upgraded relatively quickly. With my modified libgcc, when the two fields are missing, I use the cache to provide a "load base hint". I then do the ordinary module scan, but I exit early if dl_phdr_info::dlpi_addr doesn't match the hint. Unless a module has been unloaded, this first pass should find a matching module. Otherwise, I do a second pass that doesn't use the hint. This limited cache use appears to give most of the benefit of the cache. An NDK user provided an "extest" benchmark at[1], which loads 100 DSOs and throws 10,000 exceptions. I see run-time numbers like: old libgcc, old rtld: ~6600ms (no caching) new libgcc, old rtld: ~1500ms (uses load_base hint) new libgcc, new rtld: ~900ms (uses dlpi_adds / dlpi_subs) The patch refactors much of unwind-dw2-fde-dip.c, and the changes are mostly only useful for targets which lack the dlpi_adds and dlpi_subs fields, so I'm not sure how interested upstream would be in the patch. Still, I'd appreciate feedback if anyone wants to share any. I can send a patch to gcc-patches, but for now, the changes are on Android's Gerrit tracker[3][4]. Various details: On a cache miss, I think _Unwind_IteratePhdrCallback is rescanning the cache for every DSO. I think it needs to set prev_cache_entry and next_cache_entry, but that could still be done once per dl_iterate_phdr call, if we assume that dl_iterate_phdr callbacks aren't interleaved (or overlapped). I think that's the case with glibc, FreeBSD, and Bionic. AFAICT, musl's dl_iterate_phdr allows concurrent callbacks, so maybe the musl+libgcc combination is racy. _Unwind_IteratePhdrCallback scans a phdr table in one pass looking for all the segments it cares about (usually PT_LOAD, PT_GNU_EH_FRAME, and PT_DYNAMIC). I think it makes sense to have an initial pass that only looks for PT_LOAD, because most modules won't match the PC. If the PC is less than the load base (dlpi_addr), then I think the unwinder can ignore the phdr entries. This optimization is present in libunwind_llvm. It helps with the extest benchmark but introduces a lot of variance[2]. ASLR then affects exception handling performance. This work was motivated by an NDK user's complaint that arm64 exceptions (libgcc) were slower than arm32 exceptions[1]. For arm32, Android uses LLVM's libunwind instead, which uses dl_unwind_find_exidx/__gnu_Unwind_Find_exidx. Bionic's linker has optimized versions of these functions. I'm wondering why libgcc checks both dlpi_adds and dlpi_subs. Isn't dlpi_subs sufficient? I'm wondering if there are any tests or benchmarks that would be useful here. I haven't tried running the gcc testsuite yet. Thanks, -Ryan [1] https://github.com/android/ndk/issues/1062 [2] https://github.com/android/ndk/issues/1062#issuecomment-536866310 [3] https://android-review.googlesource.com/c/toolchain/gcc/+/1130238 [4] https://android-review.googlesource.com/c/toolchain/gcc/+/1133815