Re: problem with qpid heartbeats when sending msgs with size over 1KB

Tom M Wed, 08 Feb 2012 10:02:15 -0800

Hi Gordon (and team),
have you been able to try this yet?
Or, should I go ahead and submit this as an bug/issue to be worked?


thanks,
Tom

On Tue, Jan 31, 2012 at 4:03 PM, Tom M <[email protected]> wrote:

> Hi Gordon (and team),
> Have you had a chance to look at this more?  (in  particular, running the
> test while pulling the network cable (or abruptly shutting down the broker
> host)).
> thanks,
> Tom
>
>
>
> On Sat, Jan 14, 2012 at 6:07 PM, Tom M <[email protected]> wrote:
>
>> I don't have root privages on this system to run the script from Alan.
>> But, I did again verify that I get the same results just disconnecting
>> the network cable for the client host (while running my test client).
>> (Incidentlly, I don' t remember if I mentioned, we first detected this
>> problem when our broker host crash (so the OS, NIC, nor switch link had
>> opportunity to close gracefully), which, I assume, is a case that is
>> intended to be covered, but just pulling the network cable appears to give
>> the same result (yet, obviously, a less radical test to work with)).
>>
>>  Also, a few more notes from the orig trace logs (which you may have
>> already noticed).
>> * As seen in the failed case 01_03d  (which actually had trace logging (I
>> miss labeled in my email description)),
>> with the larger sent message size, the underlining qpid client had
>> stopped sending msg's before the heartbeat timeout would have occurred.
>> The heartbeat rate was set to 8.
>> The cable was pulled at about "msg: 30", and the last "trace SENT" was
>> for msg:41 (one msg per sec).
>> Then, 4 more messages were "sent" by the application (via
>> MessageReplayTracker) but there are no traces from the qpid client code.
>> Then, almost 16 seconds after the net disconnect, there is an indication
>> of "Traffic timeout", but there is no action by the client code after
>> this.
>>
>> In the good cases (ie. case 01_03b with the smaller msg, again net cable
>> pulled at about msg:30), we continue to see qpid client performing "SENT"s
>>  for all messages up to the "Traffic timeout".  Then the timeout occurs,
>> which is followed by the "Exception constructed" for the close (which does
>> not happen with the failed case).
>>
>> I'm wondering if an outgoing send buffering filling up is somehow
>> blocking the logic to act on the timout.
>>
>> I don't know if this is somehow affected by the actual link level getting
>> disconnected (which, if somehow related, might help to explain the
>> differing results when doing a STOP on the broker as opposed to the broker
>> host going down or net pulled.
>> Also, I wondering if just dropping the network packets will have the same
>> or different affect (particularly if the condition requires the send path
>> backing up).
>>
>> thanks,
>> Tom
>>
>>
>>   On Fri, Jan 13, 2012 at 8:34 AM, Alan Conway <[email protected]>wrote:
>>
>>>  On 01/13/2012 09:07 AM, Tom M wrote:
>>>
>>>> Hi Gordon,
>>>> sorry I didn't get back to you yesterday, but I needed to get an
>>>> opportunity to start a separate broker that I could stop on one of our
>>>> systems.
>>>>
>>>> We had seen these different results with our deployed system, and I
>>>> want to
>>>> make sure that this test client acted the same way.
>>>> (on our deployed sytem, most of failover testing had been done with a
>>>> kill
>>>> on the broker and we would see the client detect the lost connection.
>>>>  But
>>>> later, when a host died, we found this other condition, where the client
>>>> did not detect it)
>>>>
>>>> So, I ran the test:
>>>> * started a separate broker (see below)
>>>> * started the same test client, as before, with the larger msg size
>>>> * did a kill on the broker:    kill -STOP<pid>
>>>> * saw the same results as you did, the client detected the loss
>>>> connection
>>>> in about 2x heartbeat rate
>>>>
>>>> Then, to verify my earlier results, I ran the same exact test, except
>>>> this
>>>> time pulled the network cable:
>>>> * started a separate broker (same as previous run above)
>>>> * started the same test client, as before, with the larger msg size
>>>> * pulled network cable
>>>> * saw same results as my previous tests: client continued to "send" well
>>>> past the heartbeat timeout should have been (seeing same trace
>>>> messages),
>>>> until about 80 seconds later, locked up.
>>>>
>>>> Note, I've noted this result (client sending the larger msg missing the
>>>> detection of the lost connection) also happens if the broker host
>>>> abruptly
>>>> dies (which is how we first detected the problem).
>>>>
>>>>
>>>> Note: I used the following to start broker:
>>>> /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file
>>>> /export/hps/dda/qpidd_x/log/**qpidd_x.log --worker-threads 3 --data-dir
>>>> /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-**
>>>> dir
>>>> --auth no --config /dev/null
>>>>
>>>> So, please let me know if you can run the test again, but pulling the
>>>> network cable (I'm pulling net between broker and switch, but, I'm
>>>> pretty
>>>> sure I've seen the same when pulling net between switch and client).
>>>> thanks,
>>>> Tom
>>>>
>>>
>>> You can simulate a network cable pull by telling iptables to drop
>>> packets. Attached an old script, no warranty.
>>> WARNING: if you've got remote access only to the machine in question be
>>> careful you don't shut yourself out! The attached script only drops
>>> corosync/openais packets, so you can still ssh etc.
>>>
>>
>>
>

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Reply via email to