Re: Messages stuck after Client host reboot

Josh Carlson Wed, 14 Apr 2010 14:35:39 -0700

Folks ... just because I hate nothing more than coming across a postwith out a solution, I thought I'd post what I did. After discoveringthe same problem on Solaris as Linux I decided that TCP keepalive mightbe the answer.


Activemq does appear to allow you to set this:


      http://activemq.apache.org/tcp-transport-reference.html

However my attempts using STOMP did not work:

<transportConnector name="stomp" uri="stomp://mmq1:61613?keepAlive"/>

A strace of the JVM shows that the socket option never gets set. AMQdevs, should that have worked?

Anyway, so I decided to use LD_PRELOAD to enable keep alive. Idownloaded this project:


    http://libkeepalive.sourceforge.net/

changed it to interpose accept() as well and it worked like a charm. Themessage gets re-dispatched according to whatever keepalive parameters Ihave set. Lovely. I've submitted my changes to the libkeepalive projectowner.


Cheers,

Josh

On 04/14/2010 11:58 AM, Josh Carlson wrote:

Hi Dejan,

I don't think it would be practical or correct for us to do thatclient side. The thing that gets me though is that killing the client*process* causes the tcp connection to get closed on the other end.But killing client *host* keeps the tcp connection established on theother end. Isn't that a kernel bug? Shouldn't it behave the same wayin both circumstances?


Cheers

Josh

On 04/14/2010 11:48 AM, Dejan Bosanac wrote:

Hi Josh,

that's the job of inactivity monitor when using the OpenWire.Unfortunately Stomp doesn't support that in version 1.0 and it issomething we want to add in the next version of the spec. Maybeimplementing something like that on the application level would helpin your case?


Cheers
--
Dejan Bosanac - http://twitter.com/dejanb

Open Source Integration - http://fusesource.com/
ActiveMQ in Action - http://www.manning.com/snyder/
Blog - http://www.nighttale.net

On Wed, Apr 14, 2010 at 5:41 PM, Josh Carlson <jcarl...@e-dialog.com<mailto:jcarl...@e-dialog.com>> wrote:


    Hmm. If a timeout was the solution to this problem how would you
    be able to tell the difference between something being wrong and
    the client just being slow.

    I did an strace on the server and discovered how the timeout is
    being used. As a parameter to poll

    6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000
    <unfinished ...>
    6805  10:31:15 <... poll resumed> )     = 1 ([{fd=94,
    revents=POLLIN}])
    6805  10:31:15 recvfrom(94, "CONNECT\npasscode:...."..., 8192, 0,
    NULL, NULL) = 39
    6805  10:31:15 sendto(94, "CONNECTED\nsession:ID:mmq1-40144-"...,
    53, 0, NULL, 0) = 53
    6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000) =
    1 ([{fd=94, revents=POLLIN}])
    6805  10:31:15 recvfrom(94,
    "SUBSCRIBE\nactivemq.prefetchSize:"..., 8192, 0, NULL, NULL) = 138
    6805  10:31:15 sendto(94, "RECEIPT\nreceipt-id:39ef0e071a549"...,
    55, 0, NULL, 0) = 55
    6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000
    <unfinished ...>
    6805  10:32:15 <... poll resumed> )     = 0 (Timeout)
    6805  10:32:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000
    <unfinished ...>
    6805  10:33:15 <... poll resumed> )     = 0 (Timeout)
    6805  10:33:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000
    <unfinished ...>
    6805  10:34:15 <... poll resumed> )     = 0 (Timeout)

    In the output above I stripped lines that were not operations
    directly on the socket. The poll Timeouts continue on ... with
    nothing in between.

    [r...@mmq1 tmp]# lsof -p 6755 | grep mmq1

java 6755 root 85u IPv6 1036912TCP mmq1.eng.e-dialog.com:61613

    <http://mmq1.eng.e-dialog.com:61613> (LISTEN)

java 6755 root 92u IPv6 1038039TCP mmq1.eng.e-dialog.com:61613->10.0.13.230:46542

    <http://10.0.13.230:46542> (ESTABLISHED)

java 6755 root 94u IPv6 1036997TCP mmq1.eng.e-dialog.com:61613->mmd2.eng.e-dialog.com:41743

    <http://mmd2.eng.e-dialog.com:41743> (ESTABLISHED)

    The connection to mmd2 is the host that is gone. The one to
    10.0.13.230 is up and active. When I kill -9 the process on
    10.0.13.230 I see this in the logs:

    2010-04-13 17:13:55,322 | DEBUG | Transport failed:
    java.io.EOFException |
    org.apache.activemq.broker.TransportConnection.Transport |
    ActiveMQ Transport: tcp:///10.0.13.230:45463
    <http://10.0.13.230:45463>
    java.io.EOFException
           at java.io.DataInputStream.readByte(Unknown Source)
           at
    
org.apache.activemq.transport.stomp.StompWireFormat.readLine(StompWireFormat.java:186)
           at
    
org.apache.activemq.transport.stomp.StompWireFormat.unmarshal(StompWireFormat.java:94)
           at
    
org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:211)
           at
    org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:203)
           at
    org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:186)
           at java.lang.Thread.run(Unknown Source)
    2010-04-13 17:13:55,325 | DEBUG | Stopping connection:
    /10.0.13.230:45463 <http://10.0.13.230:45463> |
    org.apache.activemq.broker.TransportConnection | ActiveMQ Task
    2010-04-13 17:13:55,325 | DEBUG | Stopping transport
    tcp:///10.0.13.230:45463 <http://10.0.13.230:45463> |
    org.apache.activemq.transport.tcp.TcpTransport | ActiveMQ Task
    2010-04-13 17:13:55,326 | DEBUG | Stopped transport:
    /10.0.13.230:45463 <http://10.0.13.230:45463> |
    org.apache.activemq.broker.TransportConnection | ActiveMQ Task
    2010-04-13 17:13:55,327 | DEBUG | Cleaning up connection
    resources: /10.0.13.230:45463 <http://10.0.13.230:45463> |
    org.apache.activemq.broker.TransportConnection | ActiveMQ Task
    2010-04-13 17:13:55,327 | DEBUG | remove connection id:
    ID:mmq1-58415-1271193024658-2:3 |
    org.apache.activemq.broker.TransportConnection | ActiveMQ Task
    2010-04-13 17:13:55,328 | DEBUG | masterb1 removing consumer:
    ID:mmq1-58415-1271193024658-2:3:-1:1 for destination:
    queue://Producer/TESTING/weight/three/ |
    org.apache.activemq.broker.region.AbstractRegion | ActiveMQ Task

    Which is what I want to happen when the host goes down.

    It seems to be that something should be noticing that the
    connection is really gone. Maybe this is more of a kernel issue.
    I would think that when the poll is done that it would trigger
    the connection to move from the ESTABLISHED state and get closed.

    We are using Linux, kernel version 2.6.18, but I've seen this
    same issue on a range of different 2.6 versions.

    -Josh



    On 04/14/2010 09:38 AM, Josh Carlson wrote:

        Thanks Gary for the, as usual, helpful information.

        It looks like the broker maybe suffering from exactly the
        same problem
        we encountered when implementing client-side failover. Namely
        that when
        the master broker went down a subsequent read on the socket
        by the
        client would hang (well actually take a very long time to
        fail/timeout).
        In that case our TCP connection was ESTABLISHED and looking
        at the
        broker I see the same thing after the client host goes away (the
        connection is ESTABLISHED). We fixed this issue in our client
        by setting
        the socket option SO_RCVTIMEO on the connection to the broker.

        I noted what the broker appears to do the same thing with the TCP
        transport option soTimeout. It looks like when this is set it
        winds up
        as a call to java.net.Socket.setSoTimeout when the socket is
        getting
        initialized. I have not done any socket programming in Java
        but my
        assumption is that SO_TIMEOUT maps to both SO_RCVTIMEO and
        SO_SNDTIMEO
        in the C world.

        I was hopeful with this option but when I set in in my
        transport connector:

        <transportConnector name="stomp"
        uri="stomp://mmq1:61613?soTimeout=60000"/>

        the timeout does not occur. I actually ran my test case about
        15 hours
        ago and I can still see that the broker still has an ESTABLISHED
        connection to the dead client and has a message dispatched to it.

        Am I miss understanding what soTimeout is for? I can see in
        org.apache.activemq.transport.tcp.TcpTransport.initialiseSocket
        that
        setSoTimeout is getting called unconditionally. So what I'm
        wondering is
        if it is actually calling it with a 0 value despite the way I
        set up my
        transport connector. I suppose setting this to 0 would
        explain why it
        apparently never times out where in our client case it
        eventually did
        timeout (because we were not setting the option at all before).




        On 04/14/2010 05:15 AM, Gary Tully wrote:

            The re-dispatch is triggered by the tcp connection dying,
            netstat can
            help with the diagnosis here. Check the connection state
            of the broker
            port after the client host is rebooted, if the connection
            is still
            active, possibly in a timed_wait state, you may need to
            configure some
            additional timeout options on the broker side.

            On 13 April 2010 19:43, Josh
            Carlson<jcarl...@e-dialog.com <mailto:jcarl...@e-dialog.com>
            <mailto:jcarl...@e-dialog.com
            <mailto:jcarl...@e-dialog.com>>>  wrote:

                I am using client acknowledgements with a prefetch
            size of 1 with
                no message expiration policy. When a consumer
            subscribes to a
                queue I can see that the message gets dispatched
            correctly. If the
                process gets killed before retrieving and
            acknowledging the
                message I see the message getting re-dispatched
            (correctly). I
                expected this same behaviour if the host running the
            process gets
                rebooted or crashes. However, after reboot I can see
            that the
                message is stuck in the dispatched state to the
            consumer that is
                long gone. Is there a way that I can get messages
            re-dispatched
                when a host hosting consumer processes gets
            re-booted? How does it
                detect the case when a process dies (even with SIGKILL)?

                I did notice that if I increase my prefetch size and
            enqueue
                another message after the reboot, that activemq will
            re-dispatch
                the original message. However with prefetch size
            equal to one the
                message never seems to get re-dispatched.

--http://blog.garytully.com


            Open Source Integration
            http://fusesource.com

Re: Messages stuck after Client host reboot

Reply via email to