First message failed to send, hopefully this one works...
On Wed, Aug 21, 2019 at 6:16 AM, Mark Wielaard <m...@klomp.org> wrote:
Hi Jonathon and Srđan,
On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
For parallel applications that need the information in the DIEs, the
Dwarf_Abbrev hash table et al. become a massive data race. This
fixes
that by:
1. Adding atomics & locks to the hash table to manage concurrency
(lib/dynamicsizehash_concurrent.{c,h})
2. Adding a lock & array structure to the memory manager
(pseudo-TLS)
(libdwP.h, libdw_alloc.c)
3. Adding extra configure options for Helgrind/DRD annotations
(configure.ac)
4. Including "stdatomic.h" from FreeBSD, to support C11-style
atomics.
(lib/stdatomic.h)
This looks like really nice work. Thanks!
I am splitting review in some smaller parts if you don't mind.
Simply because it is large and I cannot keep everything in my head at
once :) But here some initial overall comments.
Thank you for looking over it! Comments on comments below.
Notes:
- GCC >= 4.9 provides <stdatomic.h> natively; for those versions
lib/stdatomic.h could be removed or disabled. We can also
rewrite the
file if the copyright becomes an issue.
If the compiler provides stdatomic.h then I think it would be good to
use that instead of our own implementation. The copyright isn't a
problem. But do you have a reference/URL to the upstream version? I
like to add that somewhere, so we can sync with it in the future. I
see
various commented out parts. Was that already upstream? Should we just
remove those parts?
It would definitely be preferable to use the compiler's implementation
if possible, we used this in case GCC 4.7 and 4.8 (RHEL7) compatibility
was needed. If those versions are old enough the file can be removed
entirely.
The upstream is at
https://github.com/freebsd/freebsd/blob/master/sys/sys/stdatomic.h,
although the version here appears to be slightly modified (we used the
version that HPCToolkit ships). The components we use don't seem
affected, so a resync shouldn't make a difference.
- Currently the concurrent hash table is always enabled,
performance-wise there is no known difference between it
and the non-concurrent version.
This can be changed to toggle with --enable-thread-safety
if preferred.
I would prefer it always enabled, unless there is a massive slowdown
of
the single-threaded case. The problem with --enable-thread-safety is
that it is a) known broken (sigh) and b) it basically introduces two
separate libraries that behave subtly differently. I would very much
like to get rid of --enable-thread-safety by fixing the broken locking
and simply making it the default.
I haven't noticed any slowdown in the single-threaded case, although I
haven't stressed it hard enough to find out for certain. From a
theoretical standpoint it shouldn't, atomics (with the proper memory
orders) are usually (on x86 at least) as cheap as normal accesses when
used by a single thread, and the algorithm is otherwise effectively the
same as the original hash table.
How difficult would it be to fix the locking (or rather, what's
"broken")? We would definitely benefit from having thread-safety at
least for getters, which would only need locks around the internal
management.
- Another implementation of #2 above might use dynamic TLS
(pthread_key_*),
we chose this implementation to reduce the overall complexity.
Are there any other trade-offs?
If the application spawns N threads that all use a Dwarf object (same
or different) enough to cause an allocation, and then joins them all,
any Dwarf objects allocated after will allocate N unusable slots in the
mem_tails array. Thus long-running applications (that don't use thread
pools) would start experiencing effects similar to a memory leak, of 1
pointer's worth (8 bytes on 64-bit) per dead thread.
The alternative is to use dynamic TLS so that only threads that are
currently active use the extra memory, assuming libpthread is
sufficiently proactive about reclaiming unused key values. I think if
we assume `dwarf_end` happens-after any memory management (which would
make sense for a destructor), there should be a simple atomic pattern
to handle the freeing, but I would need to sit down for a while to work
out a sufficient proof.
I was also under the impression that dynamic TLS was particularly
expensive performance-wise, although I haven't experimented with it
myself enough to know. The compiler can be a little smarter about
static TLS, and the result is easier to reason about, which is why we
chose this implementation for initial patch. If the alternative would
be preferable we can change it.
Thanks,
Mark