great suggestion. I found that just before the crash, ReaderArena gets constructed twice at the same memory location. That would explain why locks don't work as expected. Looking at the capnp code, we were wondering whether allocatedArena variable of MessageReader needs to be protected for multi-threading.
On Monday, October 7, 2019 at 12:16:42 PM UTC-7, Kenton Varda wrote: > > Could this be use-after-free? Maybe the MessageReader (and hence the > mutex) is already deleted? > > -Kenton > > On Mon, Oct 7, 2019 at 11:58 AM <[email protected] <javascript:>> > wrote: > >> Thanks for helping me debug this. >> >> 1. With KJ_USE_FUTEX=0 I'm hitting what initially seems to be a >> deadlock. I'll debug this further now. >> 2. I don't have any crash if I put a static std::mutex >> in c++/src/capnp/arena.c++ and lock it using std::unique_lock right >> after moreSegments.lockExclusive(); >> 3. I still observe the issue if (instead of #2) I make >> std::unique_ptr<std::mutex> a member variable of ReaderArena and lock it >> before or after moreSegments.lockExclusive(); >> - I have printed KJ_DBG(my_unique_lock.owns_lock(), my_thread_id, >> this); and interestingly, two threads claim they own the lock: >> - >> - capnp/arena.c++:150: debug: l.owns_lock() = true; this_id = >> 12814521112356468066; this = 7fd7f800a478 >> - capnp/arena.c++:150: debug: l.owns_lock() = true; this_id = >> 2054456097081703779; this = 7fd7f800a478 >> - This is really weird :) >> >> I'm on Ubuntu 18.04.3 LTS. >> >> On Saturday, October 5, 2019 at 4:16:32 PM UTC-7, Kenton Varda wrote: >>> >>> Hmm, that's strange! >>> >>> What operating system is this? >>> >>> If it happens to be Linux, could you try compiling everything with >>> -DKJ_USE_FUTEX=0 (or remove the `#define KJ_USE_FUTEX 1` from the top of >>> c++/src/kj/mutex.h), and see if that changes anything? This change will >>> make KJ use a completely different mutex implementation. (That said, the >>> futex-based implementation has seen very heavy use with no problems in the >>> past, so it would be surprising if it were broken somehow.) >>> >>> -Kenton >>> >>> On Sat, Oct 5, 2019 at 2:35 PM <[email protected]> wrote: >>> >>>> As an update, I've tried to place the following messages >>>> to c++/src/capnp/arena.c++: >>>> >>>> SegmentMap* segments = nullptr; >>>> KJ_IF_MAYBE(s, *lock) { >>>> KJ_IF_MAYBE(segment, s->find(id.value)) { >>>> return *segment; >>>> } >>>> segments = s; >>>> + } else { >>>> + size_t this_id = >>>> std::hash<std::thread::id>{}(std::this_thread::get_id()); >>>> + KJ_DBG("map doesn't exist", this_id, this); >>>> } >>>> >>>> It looks like (just before the crash) multiple threads print "map >>>> doesn't exist" for the same 'this' value. It's as if lock did not work for >>>> some reason. I could not reproduce the issue in a pure capnp test yet. >>>> >>>> For context, we have the same type of message with 2 segments printed >>>> in a high frequency. We have a stack of them being read by multiple >>>> readers. Apart from the mentioned exception being thrown, we often have >>>> segfaults in the insert() function. >>>> >>>> On Tuesday, October 1, 2019 at 10:39:16 AM UTC-7, Cenk Oguz Saglam >>>> wrote: >>>>> >>>>> Thanks for the quick response Kenton. >>>>> >>>>> I was also suspecting a race condition. Thanks for checking the mutex. >>>>> It is very likely that the issue is due to our usage. I'll share what I >>>>> find as I debug this further. >>>>> >>>>> On Tuesday, October 1, 2019 at 9:30:42 AM UTC-7, Kenton Varda wrote: >>>>>> >>>>>> Hi Oguz, >>>>>> >>>>>> You can get better stack traces by compiling in debug mode (both >>>>>> Cap'n Proto itself, and your project). You should then see a symbolic >>>>>> trace >>>>>> instead of a bunch of addresses. >>>>>> >>>>>> This is a strange error, though. Looking at the code for >>>>>> ReaderArena::tryGetSegment(), the insert() call only happens after a >>>>>> find() >>>>>> call looking for the same key has failed. How could the inserted row >>>>>> already exist, then? >>>>>> >>>>>> Moreover, the whole sequence is performed under a mutex lock, >>>>>> seemingly ruling out any race conditions. >>>>>> >>>>>> I'm not sure what to say here. If you can come up with a >>>>>> self-contained test case that reproduces the issue, I'd be happy to >>>>>> debug. >>>>>> >>>>>> -Kenton >>>>>> >>>>>> On Tue, Oct 1, 2019 at 9:10 AM <[email protected]> wrote: >>>>>> >>>>>>> Thanks for this amazing software. >>>>>>> >>>>>>> We are using v0.7.0. I would like to ask help debugging the >>>>>>> following exception which we rarely but consistently get: >>>>>>> >>>>>>> terminate called after throwing an instance of 'kj::ExceptionImpl' >>>>>>> what(): kj/table.c++:44: failed: inserted row already exists in >>>>>>> table >>>>>>> stack: 7f6f7f0697 7f6f7f1ee3 7f6f802623 7f6f5dc9ab 7f6f5b4823 >>>>>>> 7f77fdd5bf 557c6f2957 557c63df2b 7f78a30e13 7f78b0f087 >>>>>>> >>>>>>> Our backtrace shows that we were trying to read from a proto, then >>>>>>> the following two functions in capnp were called: >>>>>>> >>>>>>> - kj::Table<kj::HashMap<unsigned int, >>>>>>> kj::Own<capnp::_::SegmentReader> >::Entry, >>>>>>> kj::HashIndex<kj::HashMap<unsigned int, >>>>>>> kj::Own<capnp::_::SegmentReader> >>>>>>> >::Callbacks> >::insert(kj::HashMap<unsigned int, >>>>>>> kj::Own<capnp::_::SegmentReader> >::Entry&&) >>>>>>> - capnp::_::ReaderArena::tryGetSegment(kj::Id<unsigned int, >>>>>>> capnp::_::Segment>) >>>>>>> >>>>>>> Why would reading from proto trigger an insert call? >>>>>>> >>>>>>> How can I make use of the "stack: 7f6f7f0697 7f6f7f1ee3..." to >>>>>>> debug this further? >>>>>>> >>>>>>> If the way we read proto is OK, can this perhaps be caused by how we >>>>>>> populate the proto contents? Perhaps some missing or corrupt members? >>>>>>> >>>>>>> Thank you very much for help in advance, >>>>>>> Oguz >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "Cap'n Proto" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/capnproto/d6512e2e-b990-47c6-9892-a252c0c629c3%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/capnproto/d6512e2e-b990-47c6-9892-a252c0c629c3%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Cap'n Proto" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/capnproto/874ea12b-865c-4acb-9b11-74cd0154ee63%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/capnproto/874ea12b-865c-4acb-9b11-74cd0154ee63%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "Cap'n Proto" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/capnproto/bbe9f2bd-24c9-4508-a640-bb92824a093f%40googlegroups.com >> >> <https://groups.google.com/d/msgid/capnproto/bbe9f2bd-24c9-4508-a640-bb92824a093f%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "Cap'n Proto" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/capnproto/245fd14b-0061-4346-af50-3804baef15f5%40googlegroups.com.
