Hi Gordon (and team), have you been able to try this yet? Or, should I go ahead and submit this as an bug/issue to be worked?
thanks, Tom On Tue, Jan 31, 2012 at 4:03 PM, Tom M <[email protected]> wrote: > Hi Gordon (and team), > Have you had a chance to look at this more? (in particular, running the > test while pulling the network cable (or abruptly shutting down the broker > host)). > thanks, > Tom > > > > On Sat, Jan 14, 2012 at 6:07 PM, Tom M <[email protected]> wrote: > >> I don't have root privages on this system to run the script from Alan. >> But, I did again verify that I get the same results just disconnecting >> the network cable for the client host (while running my test client). >> (Incidentlly, I don' t remember if I mentioned, we first detected this >> problem when our broker host crash (so the OS, NIC, nor switch link had >> opportunity to close gracefully), which, I assume, is a case that is >> intended to be covered, but just pulling the network cable appears to give >> the same result (yet, obviously, a less radical test to work with)). >> >> Also, a few more notes from the orig trace logs (which you may have >> already noticed). >> * As seen in the failed case 01_03d (which actually had trace logging (I >> miss labeled in my email description)), >> with the larger sent message size, the underlining qpid client had >> stopped sending msg's before the heartbeat timeout would have occurred. >> The heartbeat rate was set to 8. >> The cable was pulled at about "msg: 30", and the last "trace SENT" was >> for msg:41 (one msg per sec). >> Then, 4 more messages were "sent" by the application (via >> MessageReplayTracker) but there are no traces from the qpid client code. >> Then, almost 16 seconds after the net disconnect, there is an indication >> of "Traffic timeout", but there is no action by the client code after >> this. >> >> In the good cases (ie. case 01_03b with the smaller msg, again net cable >> pulled at about msg:30), we continue to see qpid client performing "SENT"s >> for all messages up to the "Traffic timeout". Then the timeout occurs, >> which is followed by the "Exception constructed" for the close (which does >> not happen with the failed case). >> >> I'm wondering if an outgoing send buffering filling up is somehow >> blocking the logic to act on the timout. >> >> I don't know if this is somehow affected by the actual link level getting >> disconnected (which, if somehow related, might help to explain the >> differing results when doing a STOP on the broker as opposed to the broker >> host going down or net pulled. >> Also, I wondering if just dropping the network packets will have the same >> or different affect (particularly if the condition requires the send path >> backing up). >> >> thanks, >> Tom >> >> >> On Fri, Jan 13, 2012 at 8:34 AM, Alan Conway <[email protected]>wrote: >> >>> On 01/13/2012 09:07 AM, Tom M wrote: >>> >>>> Hi Gordon, >>>> sorry I didn't get back to you yesterday, but I needed to get an >>>> opportunity to start a separate broker that I could stop on one of our >>>> systems. >>>> >>>> We had seen these different results with our deployed system, and I >>>> want to >>>> make sure that this test client acted the same way. >>>> (on our deployed sytem, most of failover testing had been done with a >>>> kill >>>> on the broker and we would see the client detect the lost connection. >>>> But >>>> later, when a host died, we found this other condition, where the client >>>> did not detect it) >>>> >>>> So, I ran the test: >>>> * started a separate broker (see below) >>>> * started the same test client, as before, with the larger msg size >>>> * did a kill on the broker: kill -STOP<pid> >>>> * saw the same results as you did, the client detected the loss >>>> connection >>>> in about 2x heartbeat rate >>>> >>>> Then, to verify my earlier results, I ran the same exact test, except >>>> this >>>> time pulled the network cable: >>>> * started a separate broker (same as previous run above) >>>> * started the same test client, as before, with the larger msg size >>>> * pulled network cable >>>> * saw same results as my previous tests: client continued to "send" well >>>> past the heartbeat timeout should have been (seeing same trace >>>> messages), >>>> until about 80 seconds later, locked up. >>>> >>>> Note, I've noted this result (client sending the larger msg missing the >>>> detection of the lost connection) also happens if the broker host >>>> abruptly >>>> dies (which is how we first detected the problem). >>>> >>>> >>>> Note: I used the following to start broker: >>>> /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file >>>> /export/hps/dda/qpidd_x/log/**qpidd_x.log --worker-threads 3 --data-dir >>>> /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-** >>>> dir >>>> --auth no --config /dev/null >>>> >>>> So, please let me know if you can run the test again, but pulling the >>>> network cable (I'm pulling net between broker and switch, but, I'm >>>> pretty >>>> sure I've seen the same when pulling net between switch and client). >>>> thanks, >>>> Tom >>>> >>> >>> You can simulate a network cable pull by telling iptables to drop >>> packets. Attached an old script, no warranty. >>> WARNING: if you've got remote access only to the machine in question be >>> careful you don't shut yourself out! The attached script only drops >>> corosync/openais packets, so you can still ssh etc. >>> >> >> >
