Hello all, My name is Bogdan Codres from Wind River.
Recently, we received a crash from one of our customer. This happened only once and we do not have a clear path on how to reproduce this. The crash happened on ARMv7 and the version of lttng-tools was 2.12. This is the backtrace of the crash: (gdb) bt #0 __libc_do_syscall () at libc-do-syscall.S:49 #1 0xb6e13ad4 in __libc_signal_restore_set (set=0xb39f94e0) at ../sysdeps/unix/sysv/linux/internal-signals.h:84 #2 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48 #3 0xb6e061a6 in __GI_abort () at abort.c:79 #4 0xb6e0ed90 in __assert_fail_base (fmt=0xb6ebfed0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x517e10 "!stream->trace_chunk", assertion@entry=0xb39fe300 "\001", file=0x51d844 "../../../../git/src/common/ust-consumer/ust-consumer.c", file@entry=0x0, line=1124, line@entry=5363780, function=function@entry=0x51d234 <__PRETTY_FUNCTION__.15949> "snapshot_channel") at assert.c:92 #5 0xb6e0ee0e in __GI___assert_fail (assertion=0xb39fe300 "\001", file=0x0, line=5363780, line@entry=1124, function=0x51d234 <__PRETTY_FUNCTION__.15949> "snapshot_channel") at assert.c:101 #6 0x004f5840 in snapshot_channel (channel=0xb42008d0, key=1, path=path@entry=0xb39f9964 "ust/uid/0/32-bit", relayd_id=relayd_id@entry=18446744073709551615, nb_packets_per_stream=0, ctx=ctx@entry=0x544048) at ../../../../git/src/common/ust-consumer/ust-consumer.c:1124 #7 0x004f9a08 in lttng_ustconsumer_recv_cmd (ctx=0x544048, sock=30, consumer_sockpoll=<optimized out>) at ../../../../git/src/common/ust-consumer/ust-consumer.c:1790 #8 0x004dfac0 in consumer_thread_sessiond_poll (data=0x544048) at ../../../../git/src/common/consumer/consumer.c:3361 #9 0xb6ee7b00 in start_thread (arg=0x98396ec3) at pthread_create.c:486 #10 0xb6e853bc in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /sysroots/armv7at2-neon-wrs-linux-gnueabi/lib/libc.so.6 Backtrace stopped: previous frame identical to this frame (corrupt stack?) There's an assert( ! stream->trace_chunk ) which fails i.e. the stream trace_chunk exists. There is a comment for the function, saying "the caller must take RCU read side lock and channel lock". The RCU read side lock is taken by the snapshot_channel, but from what I could see, nothing seems to take the channel lock in the functions calling the snapshot_channel. As this crash looked like a race condition, and if the comments in the function are correct and the channel lock is missing, it could indeed be a race condition, and therefore I wondered if anyone else has seen it. I did some source code investigation and I saw that in lttng_kconsumer_recv_cmd which have a similar structure like the lttng_ustconsumer_recv_cmd ... --> we see pthread_mutex_lock(&channel>lock); ---> in LTTNG_CONSUMER_SNAPSHOT_CHANNEL else { pthread_mutex_lock(&channel->lock); if (msg.u.snapshot_channel.metadata == 1) { ret = lttng_kconsumer_snapshot_metadata(channel, key, msg.u.snapshot_channel.pathname, msg.u.snapshot_channel.relayd_id, ctx); if (ret < 0) { ERR("Snapshot metadata failed"); ret_code = LTTCOMM_CONSUMERD_SNAPSHOT_FAILED; } } else { ret = lttng_kconsumer_snapshot_channel(channel, key, msg.u.snapshot_channel.pathname, msg.u.snapshot_channel.relayd_id, msg.u.snapshot_channel.nb_packets_per_stream, ctx); if (ret < 0) { ERR("Snapshot channel failed"); ret_code = LTTCOMM_CONSUMERD_SNAPSHOT_FAILED; } } pthread_mutex_unlock(&channel->lock); So, my question is this: shouldn't be used also in lttng_ustconsumer_recv_cmd a mutex lock for channel like it's used in lttng_kconsumer_recv_cmd ? What's your opinion on this issue ? Best Regards, Ph.D. eng. Bogdan Codres Senior Engineer at RDC-EMEA, Professional Services, Wind River
_______________________________________________ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev