I don't have root privages on this system to run the script from Alan.
But, I did again verify that I get the same results just disconnecting the
network cable for the client host (while running my test client).
(Incidentlly, I don' t remember if I mentioned, we first detected this
problem when our broker host crash (so the OS, NIC, nor switch link had
opportunity to close gracefully), which, I assume, is a case that is
intended to be covered, but just pulling the network cable appears to give
the same result (yet, obviously, a less radical test to work with)).

 Also, a few more notes from the orig trace logs (which you may have
already noticed).
* As seen in the failed case 01_03d  (which actually had trace logging (I
miss labeled in my email description)),
with the larger sent message size, the underlining qpid client had stopped
sending msg's before the heartbeat timeout would have occurred.
The heartbeat rate was set to 8.
The cable was pulled at about "msg: 30", and the last "trace SENT" was for
msg:41 (one msg per sec).
Then, 4 more messages were "sent" by the application (via
MessageReplayTracker) but there are no traces from the qpid client code.
Then, almost 16 seconds after the net disconnect, there is an indication
of "Traffic timeout", but there is no action by the client code after
this.

In the good cases (ie. case 01_03b with the smaller msg, again net cable
pulled at about msg:30), we continue to see qpid client performing "SENT"s
 for all messages up to the "Traffic timeout".  Then the timeout occurs,
which is followed by the "Exception constructed" for the close (which does
not happen with the failed case).

I'm wondering if an outgoing send buffering filling up is somehow blocking
the logic to act on the timout.

I don't know if this is somehow affected by the actual link level getting
disconnected (which, if somehow related, might help to explain the
differing results when doing a STOP on the broker as opposed to the broker
host going down or net pulled.
Also, I wondering if just dropping the network packets will have the same
or different affect (particularly if the condition requires the send path
backing up).

thanks,
Tom


On Fri, Jan 13, 2012 at 8:34 AM, Alan Conway <[email protected]> wrote:

>  On 01/13/2012 09:07 AM, Tom M wrote:
>
>> Hi Gordon,
>> sorry I didn't get back to you yesterday, but I needed to get an
>> opportunity to start a separate broker that I could stop on one of our
>> systems.
>>
>> We had seen these different results with our deployed system, and I want
>> to
>> make sure that this test client acted the same way.
>> (on our deployed sytem, most of failover testing had been done with a kill
>> on the broker and we would see the client detect the lost connection.  But
>> later, when a host died, we found this other condition, where the client
>> did not detect it)
>>
>> So, I ran the test:
>> * started a separate broker (see below)
>> * started the same test client, as before, with the larger msg size
>> * did a kill on the broker:    kill -STOP<pid>
>> * saw the same results as you did, the client detected the loss connection
>> in about 2x heartbeat rate
>>
>> Then, to verify my earlier results, I ran the same exact test, except this
>> time pulled the network cable:
>> * started a separate broker (same as previous run above)
>> * started the same test client, as before, with the larger msg size
>> * pulled network cable
>> * saw same results as my previous tests: client continued to "send" well
>> past the heartbeat timeout should have been (seeing same trace messages),
>> until about 80 seconds later, locked up.
>>
>> Note, I've noted this result (client sending the larger msg missing the
>> detection of the lost connection) also happens if the broker host abruptly
>> dies (which is how we first detected the problem).
>>
>>
>> Note: I used the following to start broker:
>> /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file
>> /export/hps/dda/qpidd_x/log/**qpidd_x.log --worker-threads 3 --data-dir
>> /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-**dir
>> --auth no --config /dev/null
>>
>> So, please let me know if you can run the test again, but pulling the
>> network cable (I'm pulling net between broker and switch, but, I'm pretty
>> sure I've seen the same when pulling net between switch and client).
>> thanks,
>> Tom
>>
>
> You can simulate a network cable pull by telling iptables to drop packets.
> Attached an old script, no warranty.
> WARNING: if you've got remote access only to the machine in question be
> careful you don't shut yourself out! The attached script only drops
> corosync/openais packets, so you can still ssh etc.
>

Reply via email to