Ah, hmm. So you are accessing the MesasgeReader itself (not just the
objects in it) from multiple threads? It has been a long time since I wrote
this code, but I suspect that wasn't intended to be supported. Normally
thread-safe methods are marked `const`, and none of MessageReader's methods
appear to be `const`. In contrast, the getter/setter methods on individual
Reader objects are marked `const`, hence all the mutex-locking to make sure
that changes they might trigger are thread-safe. But to get a Reader object
initially, you have to call getRoot(), which initializes the arena and sets
`allocatedArena = true`.

I admit this is not very intuitive. MessageReader *feels* like a read-only
type and hence like it ought to be thread-safe. I think probably I should
refactor MessageReader so that it loads the whole segment table upfront and
doesn't do any silly mutex locking, though I think subclasses may need to
be changed to support that, bleh.

As a simple work-around to fix your problem, I think you can do:

    messageReader.getRoot<capnp::AnyPointer>();

immediately after allocating the MessageReader, before it has been made
visible to other threads.

A more-correct solution is to call getRoot() with the actual root type and
then only share the returned struct Reader with other threads; don't share
the MessageReader.

-Kenton

On Mon, Oct 7, 2019 at 5:38 PM <[email protected]> wrote:

> great suggestion. I found that just before the crash, ReaderArena gets
> constructed twice at the same memory location. That would explain why locks
> don't work as expected. Looking at the capnp code, we were wondering
> whether allocatedArena variable of MessageReader needs to be protected for
> multi-threading.
>
> On Monday, October 7, 2019 at 12:16:42 PM UTC-7, Kenton Varda wrote:
>>
>> Could this be use-after-free? Maybe the MessageReader (and hence the
>> mutex) is already deleted?
>>
>> -Kenton
>>
>> On Mon, Oct 7, 2019 at 11:58 AM <[email protected]> wrote:
>>
>>> Thanks for helping me debug this.
>>>
>>>    1. With KJ_USE_FUTEX=0 I'm hitting what initially seems to be a
>>>    deadlock. I'll debug this further now.
>>>    2. I don't have any crash if I put a static std::mutex
>>>    in c++/src/capnp/arena.c++ and lock it using std::unique_lock right
>>>    after moreSegments.lockExclusive();
>>>    3. I still observe the issue if (instead of #2) I make
>>>    std::unique_ptr<std::mutex> a member variable of ReaderArena and lock it
>>>    before or after moreSegments.lockExclusive();
>>>       - I have printed KJ_DBG(my_unique_lock.owns_lock(), my_thread_id,
>>>       this); and interestingly, two threads claim they own the lock:
>>>          -
>>>          - capnp/arena.c++:150: debug: l.owns_lock() = true; this_id =
>>>          12814521112356468066; this = 7fd7f800a478
>>>          - capnp/arena.c++:150: debug: l.owns_lock() = true; this_id =
>>>          2054456097081703779; this = 7fd7f800a478
>>>       - This is really weird :)
>>>
>>> I'm on Ubuntu 18.04.3 LTS.
>>>
>>> On Saturday, October 5, 2019 at 4:16:32 PM UTC-7, Kenton Varda wrote:
>>>>
>>>> Hmm, that's strange!
>>>>
>>>> What operating system is this?
>>>>
>>>> If it happens to be Linux, could you try compiling everything with
>>>> -DKJ_USE_FUTEX=0 (or remove the `#define KJ_USE_FUTEX 1` from the top of
>>>> c++/src/kj/mutex.h), and see if that changes anything? This change will
>>>> make KJ use a completely different mutex implementation. (That said, the
>>>> futex-based implementation has seen very heavy use with no problems in the
>>>> past, so it would be surprising if it were broken somehow.)
>>>>
>>>> -Kenton
>>>>
>>>> On Sat, Oct 5, 2019 at 2:35 PM <[email protected]> wrote:
>>>>
>>>>> As an update, I've tried to place the following messages
>>>>> to c++/src/capnp/arena.c++:
>>>>>
>>>>>    SegmentMap* segments = nullptr;
>>>>>    KJ_IF_MAYBE(s, *lock) {
>>>>>      KJ_IF_MAYBE(segment, s->find(id.value)) {
>>>>>        return *segment;
>>>>>      }
>>>>>      segments = s;
>>>>> +  } else {
>>>>> +    size_t this_id =
>>>>> std::hash<std::thread::id>{}(std::this_thread::get_id());
>>>>> +    KJ_DBG("map doesn't exist", this_id, this);
>>>>>    }
>>>>>
>>>>> It looks like (just before the crash) multiple threads print "map
>>>>> doesn't exist" for the same 'this' value. It's as if lock did not work for
>>>>> some reason. I could not reproduce the issue in a pure capnp test yet.
>>>>>
>>>>> For context, we have the same type of message with 2 segments printed
>>>>> in a high frequency. We have a stack of them being read by multiple
>>>>> readers. Apart from the mentioned exception being thrown, we often have
>>>>> segfaults in the insert() function.
>>>>>
>>>>> On Tuesday, October 1, 2019 at 10:39:16 AM UTC-7, Cenk Oguz Saglam
>>>>> wrote:
>>>>>>
>>>>>> Thanks for the quick response Kenton.
>>>>>>
>>>>>> I was also suspecting a race condition. Thanks for checking the
>>>>>> mutex. It is very likely that the issue is due to our usage. I'll share
>>>>>> what I find as I debug this further.
>>>>>>
>>>>>> On Tuesday, October 1, 2019 at 9:30:42 AM UTC-7, Kenton Varda wrote:
>>>>>>>
>>>>>>> Hi Oguz,
>>>>>>>
>>>>>>> You can get better stack traces by compiling in debug mode (both
>>>>>>> Cap'n Proto itself, and your project). You should then see a symbolic 
>>>>>>> trace
>>>>>>> instead of a bunch of addresses.
>>>>>>>
>>>>>>> This is a strange error, though. Looking at the code for
>>>>>>> ReaderArena::tryGetSegment(), the insert() call only happens after a 
>>>>>>> find()
>>>>>>> call looking for the same key has failed. How could the inserted row
>>>>>>> already exist, then?
>>>>>>>
>>>>>>> Moreover, the whole sequence is performed under a mutex lock,
>>>>>>> seemingly ruling out any race conditions.
>>>>>>>
>>>>>>> I'm not sure what to say here. If you can come up with a
>>>>>>> self-contained test case that reproduces the issue, I'd be happy to 
>>>>>>> debug.
>>>>>>>
>>>>>>> -Kenton
>>>>>>>
>>>>>>> On Tue, Oct 1, 2019 at 9:10 AM <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks for this amazing software.
>>>>>>>>
>>>>>>>> We are using v0.7.0. I would like to ask help debugging the
>>>>>>>> following exception which we rarely but consistently get:
>>>>>>>>
>>>>>>>> terminate called after throwing an instance of 'kj::ExceptionImpl'
>>>>>>>>   what():  kj/table.c++:44: failed: inserted row already exists in
>>>>>>>> table
>>>>>>>> stack: 7f6f7f0697 7f6f7f1ee3 7f6f802623 7f6f5dc9ab 7f6f5b4823
>>>>>>>> 7f77fdd5bf 557c6f2957 557c63df2b 7f78a30e13 7f78b0f087
>>>>>>>>
>>>>>>>> Our backtrace shows that we were trying to read from a proto, then
>>>>>>>> the following two functions in capnp were called:
>>>>>>>>
>>>>>>>>    - kj::Table<kj::HashMap<unsigned int,
>>>>>>>>    kj::Own<capnp::_::SegmentReader> >::Entry,
>>>>>>>>    kj::HashIndex<kj::HashMap<unsigned int, 
>>>>>>>> kj::Own<capnp::_::SegmentReader>
>>>>>>>>    >::Callbacks> >::insert(kj::HashMap<unsigned int,
>>>>>>>>    kj::Own<capnp::_::SegmentReader> >::Entry&&)
>>>>>>>>    - capnp::_::ReaderArena::tryGetSegment(kj::Id<unsigned int,
>>>>>>>>    capnp::_::Segment>)
>>>>>>>>
>>>>>>>> Why would reading from proto trigger an insert call?
>>>>>>>>
>>>>>>>> How can I make use of the "stack: 7f6f7f0697 7f6f7f1ee3..." to
>>>>>>>> debug this further?
>>>>>>>>
>>>>>>>> If the way we read proto is OK, can this perhaps be caused by how
>>>>>>>> we populate the proto contents? Perhaps some missing or corrupt 
>>>>>>>> members?
>>>>>>>>
>>>>>>>> Thank you very much for help in advance,
>>>>>>>> Oguz
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "Cap'n Proto" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/capnproto/d6512e2e-b990-47c6-9892-a252c0c629c3%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/capnproto/d6512e2e-b990-47c6-9892-a252c0c629c3%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Cap'n Proto" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/capnproto/874ea12b-865c-4acb-9b11-74cd0154ee63%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/capnproto/874ea12b-865c-4acb-9b11-74cd0154ee63%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Cap'n Proto" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/capnproto/bbe9f2bd-24c9-4508-a640-bb92824a093f%40googlegroups.com
>>> <https://groups.google.com/d/msgid/capnproto/bbe9f2bd-24c9-4508-a640-bb92824a093f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Cap'n Proto" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/capnproto/245fd14b-0061-4346-af50-3804baef15f5%40googlegroups.com
> <https://groups.google.com/d/msgid/capnproto/245fd14b-0061-4346-af50-3804baef15f5%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/capnproto/CAJouXQnFRUaFiZowRGCC0ycr0Dw5QroLzowTu987UY-4aR-N-Q%40mail.gmail.com.

Reply via email to