Re: problem with qpid heartbeats when sending msgs with size over 1KB

Tom M Fri, 06 Jan 2012 17:37:57 -0800

Attached is code for this test client.
thanks,
Tom

On Fri, Jan 6, 2012 at 7:26 PM, Tom M <[email protected]> wrote:


>
> I’ve created a simpler test client (based on our deployed application) to
> test this problem.
>
> It opens a connection (with connSettings.heartbeat = 8) and session and
> then starts sending messages (once a second), using  
> MessageReplayTracker::send(
> ).  The size of the message is selectable by the user (messages of  6
> bytes or 1600bytes).
>
> After about 30 seconds, I pulled the network cable.
>
> * If sending msgs of only a 6 bytes, a qpid::Exception is thrown
> indicating the connection is closed (about 16 seconds after the cable was
> pulled).
>
> * If sending msgs of 1600 bytes, the client continued sending without
> detecting the connection closed (continued sending for about another
> 88seconds, then apparently stopped due to an out queue full, but never
> reporting an exception during this time).   From previous testing, it
> does eventually timeout (probably from the TCP timeout) and throws an
> exception, but not until 6 to 20mins later.
>
>
>
> I’ve attached the results from 4 runs (for each one, the network
> connection pulled at 30seconds into the xfers).
>
> Two runs sending msgs of only a 6bytes (good cases) and 2 runs sending
> msgs of 1600bytes (problem cases).
>
> For each, I’m including the logged output and pstack dumps.
>
> Note in particular trace logging for the problem case: 01_03d.   It may
> be notable that the Frame SEND’s stopped before the heartbeat timeout
> should have occurred (see details below).
>
>
>
> To help look for differences, I took pstack dumps at several points in
> time during the runs (included sequentially in the pstack.*.log files for
> each run, time is printed just above each block).  Primarily:
>
>    * pstack before connection pulled (20 seconds into xfers)
>
>    * pstack 10 seconds after connection pulled (a few seconds prior to 2x
> heartbeat rate)
>
>    * for cases sending 1600byte msgs, several more pstack’s after the
> point that heartbeat timeout should have occurred (after pulled connection,
> by:  20, 30, 40 seconds)
>
>         Then one more pstack after the sends stopped (due to output full).
>
>
>
> For each of the msg sizes, I did the first run with default logging (to
> show app logging) and then the second with QPID_LOG set to “trace+”
>
>
>
> Details about the 4 runs:
>
> * 01_03a:   sent msg’s size =   6 bytes  (successful case,  app only
> logging)
>
>        about 06:20:29 cable pulled at (about at sent msg:  30)
>
>        06:20:45 see qpid::Exception  Connection closed
>
>        Only app logging:     qpidConnTst.msgSmall.01_03a.log
>
>        pstack.msgSmall.01_03a.log
>
>
>
> * 01_03b:   sent msg’s size =   6 bytes  (successful case,  trace logging)
>
>           06:55:55  last  Heartbeat RECV & SENT
>
>        about 06:56:01  cable pulled at (about at sent msg:  30)
>
>        06:56:17  see qpid::Exception  Connection closed
>
>        Trace  logging:     qpidConnTst.msgSmall.01_03b.log
>
>        pstack.msgSmall.01_03b.log
>
>
>
> * 01_03c:   sent msg’s size = 1600 bytes  (problem case,  app only
> logging)
>
>        about 07:06:15  cable pulled at (about at sent msg:  30)
>
>          Note:  client continued sending without receiving an exception
>
>             Continued to send until  msg:  118  (additional 88 seconds
> after cable pulled),
>
>                 then blocked.
>
>             I terminated more than 4 mins later, still had not received
> exception
>
>
>
> * 01_03d:   sent msg’s size = 1600 bytes  (problem case,  app only
> logging)
>
>           07:18:27 last  Heartbeat RECV & SENT
>
>        about 07:18:28  cable pulled at (about at sent msg:  30)
>
>          Note:  client continued sending 11 more msgs (11 more seconds).
>
>             Then the client app continued to make calls  
> MessageReplayTracker::send(
> )
>
>                  without receiving an exception, but the traces stopped
> for Frame SENT’s
>
>                  NOTE:  this is before the  heartbeat timeout should have
> occurred.
>
>          07:18:43 debug Traffic timeout, but still no exception caught by
> app.
>
>          The app continued to make calls  MessageReplayTracker::send( )
>
>                  without receiving an exception.
>
>             Continued to send until  msg:  118  (additional 88 seconds
> after cable pulled),
>
>                 then app was blocked.
>
>             I terminated more than 4 mins later, still had not received
> exception
>
>
>
>
>
> Some additional notes about this client:
>
> * This client (as with some of the instances in our deployed application)
> is a message sender only.
>
>        * since not intending on receiving messages on this client, we
> did NOT derive this client from qpid::client::MessageListener
>
>    * sending using:   qpid::client::MessageReplayTracker::send( )
>
>    * using async sending (for speed), not calling  Session::sync( ).
>
>         However, I tried including adding  sync’s, and found that, when
> sending the 1600 byte msgs, the client would only block on the sync and
> still not get an exception for the heartbeat timeouts.
>
>     * I added a call to  Connection::isOpen( ) in the loop to see if this
> would trigger the exception, but it did not.
>
> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both
> client and broker hosts) .   In our deployed systems, we will not be able
> to upgrade qpid versions for at least several months.
>
> I will post the test code in another message.
>
> Please let us know what you can find.
>
> Thanks,
>
> Tom
>
>
>
> On Thu, Dec 8, 2011 at 7:37 AM, Tom M <[email protected]> wrote:
>
>> Gordon,
>> thank you for your response.
>>
>> Yes, the test with larger message has failed every time.
>> I will look at sending the trace logs and a thread dump from the client
>> from our system.
>>
>> thanks,
>> Tom Maggio
>> Raytheon Co.
>> Dallas, TX
>> 972/205-4377
>>
>>   On Wed, Dec 7, 2011 at 8:37 AM, Gordon Sim <[email protected]> wrote:
>>
>>> On 12/07/2011 01:55 AM, Tom M wrote:
>>>
>>>> Hello,
>>>>
>>>> we are having a problem with our MRG (qpid) system:
>>>>
>>>> * when sending messages with size of 1600bytes, a connection (used for
>>>> sending from client) does not detect the host connection is lost via
>>>> heartbeat timeout.
>>>>
>>>> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on
>>>> both
>>>> client and broker hosts)
>>>>
>>>> and Ethernet connection (TCP/IP) between hosts
>>>>
>>>>     + for this connection we have: ConnectionSettings
>>>> connectionSettings.heartbeat = 8
>>>>
>>>>     + simulating a system failure by pulling the ethernet cable to the
>>>> broker host
>>>>
>>>>     + the connection close Exception is caught by the client after many
>>>> minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and
>>>> not
>>>> the missed heartbeats.
>>>>
>>>>     + with the same exact application (for our client), if sending
>>>> messages
>>>> of 200bytes, we do get the qpid exception indicating the Connection
>>>> closed
>>>> (catch TransportFailure Exception: connection closed) within 16 seconds.
>>>> For this testing, there were no other changes between the 2 cases, other
>>>> than the size of the messages sent from the client (only expanded the
>>>> size
>>>> of the string in the body of the message) (1 message sent per second in
>>>> both cases).
>>>>
>>>> * is this a known problem with qpid 0.7?
>>>>
>>>
>>> No, i don't think this is a known issue.
>>>
>>>
>>> * is there patch to fix this for qpid 0.7?
>>>>
>>>> * has this problem already been fixed in later releases?
>>>>
>>>> NOTE: we have already deployed qpid 0.7 in our system, and we will not
>>>> be
>>>> able to upgrade to a newer full release for many months.
>>>>
>>>> I'm wondering if the problem is that the connection gets blocked with
>>>> the
>>>> first TCP packet of a multiple packet message, such that the heartbeat
>>>> detection is disabled until the full message is sent. But, if the
>>>> multi-packet message can not complete (since socket is broken), the
>>>> heartbeat logic is held disabled until the multi-packet message can
>>>> complete (which in this case it can not).
>>>>
>>>
>>> There is nothing that directly (intentionally) does anything like this.
>>> However it may be possible that there is some deadlock or liveness issue
>>> that prevents correct function in some cases.
>>>
>>> Is the test always failing with the larger message size? There is
>>> actually no difference in the AMQP framing for a 200 byte v a 1600 byte
>>> message. It may just be that the different timing of the larger write
>>> somehow triggers the issue.
>>>
>>> Can you get trace level logs and a thread dump from the client for a
>>> failed case?
>>>
>>> ------------------------------**------------------------------**
>>> ---------
>>> Apache Qpid - AMQP Messaging Implementation
>>> Project:      http://qpid.apache.org
>>> Use/Interact: 
>>> mailto:users-subscribe@qpid.**apache.org<[email protected]>
>>>
>>>
>>
>

qpidTstConn_src.tar.gz
Description: GNU Zip compressed data

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Reply via email to