On 12/07/2011 01:55 AM, Tom M wrote:
Hello,

we are having a problem with our MRG (qpid) system:

* when sending messages with size of 1600bytes, a connection (used for
sending from client) does not detect the host connection is lost via
heartbeat timeout.

+ we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both
client and broker hosts)

and Ethernet connection (TCP/IP) between hosts

     + for this connection we have: ConnectionSettings
connectionSettings.heartbeat = 8

     + simulating a system failure by pulling the ethernet cable to the
broker host

     + the connection close Exception is caught by the client after many
minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and not
the missed heartbeats.

     + with the same exact application (for our client), if sending messages
of 200bytes, we do get the qpid exception indicating the Connection closed
(catch TransportFailure Exception: connection closed) within 16 seconds.
For this testing, there were no other changes between the 2 cases, other
than the size of the messages sent from the client (only expanded the size
of the string in the body of the message) (1 message sent per second in
both cases).

* is this a known problem with qpid 0.7?

No, i don't think this is a known issue.

* is there patch to fix this for qpid 0.7?

* has this problem already been fixed in later releases?

NOTE: we have already deployed qpid 0.7 in our system, and we will not be
able to upgrade to a newer full release for many months.

I'm wondering if the problem is that the connection gets blocked with the
first TCP packet of a multiple packet message, such that the heartbeat
detection is disabled until the full message is sent. But, if the
multi-packet message can not complete (since socket is broken), the
heartbeat logic is held disabled until the multi-packet message can
complete (which in this case it can not).

There is nothing that directly (intentionally) does anything like this. However it may be possible that there is some deadlock or liveness issue that prevents correct function in some cases.

Is the test always failing with the larger message size? There is actually no difference in the AMQP framing for a 200 byte v a 1600 byte message. It may just be that the different timing of the larger write somehow triggers the issue.

Can you get trace level logs and a thread dump from the client for a failed case?

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Reply via email to