Attached is code for this test client. thanks, Tom On Fri, Jan 6, 2012 at 7:26 PM, Tom M <[email protected]> wrote:
> > I’ve created a simpler test client (based on our deployed application) to > test this problem. > > It opens a connection (with connSettings.heartbeat = 8) and session and > then starts sending messages (once a second), using > MessageReplayTracker::send( > ). The size of the message is selectable by the user (messages of 6 > bytes or 1600bytes). > > After about 30 seconds, I pulled the network cable. > > * If sending msgs of only a 6 bytes, a qpid::Exception is thrown > indicating the connection is closed (about 16 seconds after the cable was > pulled). > > * If sending msgs of 1600 bytes, the client continued sending without > detecting the connection closed (continued sending for about another > 88seconds, then apparently stopped due to an out queue full, but never > reporting an exception during this time). From previous testing, it > does eventually timeout (probably from the TCP timeout) and throws an > exception, but not until 6 to 20mins later. > > > > I’ve attached the results from 4 runs (for each one, the network > connection pulled at 30seconds into the xfers). > > Two runs sending msgs of only a 6bytes (good cases) and 2 runs sending > msgs of 1600bytes (problem cases). > > For each, I’m including the logged output and pstack dumps. > > Note in particular trace logging for the problem case: 01_03d. It may > be notable that the Frame SEND’s stopped before the heartbeat timeout > should have occurred (see details below). > > > > To help look for differences, I took pstack dumps at several points in > time during the runs (included sequentially in the pstack.*.log files for > each run, time is printed just above each block). Primarily: > > * pstack before connection pulled (20 seconds into xfers) > > * pstack 10 seconds after connection pulled (a few seconds prior to 2x > heartbeat rate) > > * for cases sending 1600byte msgs, several more pstack’s after the > point that heartbeat timeout should have occurred (after pulled connection, > by: 20, 30, 40 seconds) > > Then one more pstack after the sends stopped (due to output full). > > > > For each of the msg sizes, I did the first run with default logging (to > show app logging) and then the second with QPID_LOG set to “trace+” > > > > Details about the 4 runs: > > * 01_03a: sent msg’s size = 6 bytes (successful case, app only > logging) > > about 06:20:29 cable pulled at (about at sent msg: 30) > > 06:20:45 see qpid::Exception Connection closed > > Only app logging: qpidConnTst.msgSmall.01_03a.log > > pstack.msgSmall.01_03a.log > > > > * 01_03b: sent msg’s size = 6 bytes (successful case, trace logging) > > 06:55:55 last Heartbeat RECV & SENT > > about 06:56:01 cable pulled at (about at sent msg: 30) > > 06:56:17 see qpid::Exception Connection closed > > Trace logging: qpidConnTst.msgSmall.01_03b.log > > pstack.msgSmall.01_03b.log > > > > * 01_03c: sent msg’s size = 1600 bytes (problem case, app only > logging) > > about 07:06:15 cable pulled at (about at sent msg: 30) > > Note: client continued sending without receiving an exception > > Continued to send until msg: 118 (additional 88 seconds > after cable pulled), > > then blocked. > > I terminated more than 4 mins later, still had not received > exception > > > > * 01_03d: sent msg’s size = 1600 bytes (problem case, app only > logging) > > 07:18:27 last Heartbeat RECV & SENT > > about 07:18:28 cable pulled at (about at sent msg: 30) > > Note: client continued sending 11 more msgs (11 more seconds). > > Then the client app continued to make calls > MessageReplayTracker::send( > ) > > without receiving an exception, but the traces stopped > for Frame SENT’s > > NOTE: this is before the heartbeat timeout should have > occurred. > > 07:18:43 debug Traffic timeout, but still no exception caught by > app. > > The app continued to make calls MessageReplayTracker::send( ) > > without receiving an exception. > > Continued to send until msg: 118 (additional 88 seconds > after cable pulled), > > then app was blocked. > > I terminated more than 4 mins later, still had not received > exception > > > > > > Some additional notes about this client: > > * This client (as with some of the instances in our deployed application) > is a message sender only. > > * since not intending on receiving messages on this client, we > did NOT derive this client from qpid::client::MessageListener > > * sending using: qpid::client::MessageReplayTracker::send( ) > > * using async sending (for speed), not calling Session::sync( ). > > However, I tried including adding sync’s, and found that, when > sending the 1600 byte msgs, the client would only block on the sync and > still not get an exception for the heartbeat timeouts. > > * I added a call to Connection::isOpen( ) in the loop to see if this > would trigger the exception, but it did not. > > + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both > client and broker hosts) . In our deployed systems, we will not be able > to upgrade qpid versions for at least several months. > > I will post the test code in another message. > > Please let us know what you can find. > > Thanks, > > Tom > > > > On Thu, Dec 8, 2011 at 7:37 AM, Tom M <[email protected]> wrote: > >> Gordon, >> thank you for your response. >> >> Yes, the test with larger message has failed every time. >> I will look at sending the trace logs and a thread dump from the client >> from our system. >> >> thanks, >> Tom Maggio >> Raytheon Co. >> Dallas, TX >> 972/205-4377 >> >> On Wed, Dec 7, 2011 at 8:37 AM, Gordon Sim <[email protected]> wrote: >> >>> On 12/07/2011 01:55 AM, Tom M wrote: >>> >>>> Hello, >>>> >>>> we are having a problem with our MRG (qpid) system: >>>> >>>> * when sending messages with size of 1600bytes, a connection (used for >>>> sending from client) does not detect the host connection is lost via >>>> heartbeat timeout. >>>> >>>> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on >>>> both >>>> client and broker hosts) >>>> >>>> and Ethernet connection (TCP/IP) between hosts >>>> >>>> + for this connection we have: ConnectionSettings >>>> connectionSettings.heartbeat = 8 >>>> >>>> + simulating a system failure by pulling the ethernet cable to the >>>> broker host >>>> >>>> + the connection close Exception is caught by the client after many >>>> minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and >>>> not >>>> the missed heartbeats. >>>> >>>> + with the same exact application (for our client), if sending >>>> messages >>>> of 200bytes, we do get the qpid exception indicating the Connection >>>> closed >>>> (catch TransportFailure Exception: connection closed) within 16 seconds. >>>> For this testing, there were no other changes between the 2 cases, other >>>> than the size of the messages sent from the client (only expanded the >>>> size >>>> of the string in the body of the message) (1 message sent per second in >>>> both cases). >>>> >>>> * is this a known problem with qpid 0.7? >>>> >>> >>> No, i don't think this is a known issue. >>> >>> >>> * is there patch to fix this for qpid 0.7? >>>> >>>> * has this problem already been fixed in later releases? >>>> >>>> NOTE: we have already deployed qpid 0.7 in our system, and we will not >>>> be >>>> able to upgrade to a newer full release for many months. >>>> >>>> I'm wondering if the problem is that the connection gets blocked with >>>> the >>>> first TCP packet of a multiple packet message, such that the heartbeat >>>> detection is disabled until the full message is sent. But, if the >>>> multi-packet message can not complete (since socket is broken), the >>>> heartbeat logic is held disabled until the multi-packet message can >>>> complete (which in this case it can not). >>>> >>> >>> There is nothing that directly (intentionally) does anything like this. >>> However it may be possible that there is some deadlock or liveness issue >>> that prevents correct function in some cases. >>> >>> Is the test always failing with the larger message size? There is >>> actually no difference in the AMQP framing for a 200 byte v a 1600 byte >>> message. It may just be that the different timing of the larger write >>> somehow triggers the issue. >>> >>> Can you get trace level logs and a thread dump from the client for a >>> failed case? >>> >>> ------------------------------**------------------------------** >>> --------- >>> Apache Qpid - AMQP Messaging Implementation >>> Project: http://qpid.apache.org >>> Use/Interact: >>> mailto:users-subscribe@qpid.**apache.org<[email protected]> >>> >>> >> >
qpidTstConn_src.tar.gz
Description: GNU Zip compressed data
--------------------------------------------------------------------- Apache Qpid - AMQP Messaging Implementation Project: http://qpid.apache.org Use/Interact: mailto:[email protected]
