It's certainly a pain to diagnose. On Thu, Feb 5, 2015 at 3:19 PM, Kris G. Lindgren <klindg...@godaddy.com> wrote:
> Is Mirantis going to have someone at the ops mid-cycle? We were talking > about this in the operators channel today and it seemed like pretty much > everyone who was active has problems with rabbitmq. Either from > maintenance, failovers, or transient issues and having to "restart world" > after rabbitmq hicups to get things to recover. I am thinking if the > issue is relatively prevalent it would be nice for people who have either > "figured it out" or have something that is working to discuss their setup. > We noticed that miratis has a number of patches to oslo.messaging to fix > rabbitmq specific stuff. So I was hoping that someone could come and talk > about what Mirantis has done there to make it better and if its "there > yet" and if not what still needs to be done. > > We use clustered rabbitmq + LB and honestly this config on paper is > "better" but in practice it nothing short of a nightmare. Any maintenance > done on rabbitmq (restart/patching ect ect) or the load balancer seems to > cause clients to not notice that they are no longer correctly connected to > the rabbitmq server and they will sit happily, doing nothing, until they > are restarted. We had similar problems listing all of the rabbitmq servers > out in the configuration as well. So far my experience has been any > maintenance that touches rabbitmq is going to require a restart of all > service that communicate on rpc to avoid hard to troubleshoot (IE silent > errors) rpc issues. > > In my experience rabbitmq is pretty much the #1 cause of issues in our > environment and I think other operators would agree with that as well. > Anything that would make rabbit + openstack more stable would be very > welcome. > > ____________________________________________ > > Kris Lindgren > Senior Linux Systems Engineer > GoDaddy, LLC. > > > On 1/20/15, 8:33 AM, "Andrew Woodward" <xar...@gmail.com> wrote: > > >So this is exactly what we (@mirantis) ran into while working on the > >HA story in Fuel / Mirantis OpenStack. > > > >The short message is without heatbeat keepalive, rabbit is un-able to > >properly keep track of partially open connections resulting consumers > >(not senders) believing that they have a live connection to rabbit > >when in-fact they don't. > > > >Summary of the parts needed for rabbit HA > >* rabbit heartbeats (https://review.openstack.org/#/c/146047/) the > >oslo.messaging team is working to merge this and is well aware its a > >critical need for rabbit HA. > >* rabbit_hosts with a list of all rabbit nodes (haproxy should be > >avoided except for services that don't support rabbit_hosts [list of > >servers] there are further needs to make haproxy behave properly in > >HA) > >* consumer_cancel_notify (CCN)b > >* rabbit grater than 3.3.0 > > > >Optional: > >* rip failed nodes out of amesa db. We found that rabbit node down > >discovery was slower than we wanted (minutes) and we can force an > >election sooner by ripping the failed node out of amesa. (in this case > >Pacemaker tells us this) we have a master/slave type mechanism in our > >pacemaker script to perform this. > > > >The long message on rabbit connections. > > > >Through a quite long process we found that due to the way rabbit uses > >connection from erlang that it won't close connections, instead rabbit > >(can) send a consumer cancel notification. The consumer upon receiving > >this message is supposed to hang-up and reconnect. Otherwise the > >connection is reaped by the linux kernel when the TCP connection > >timeout is reached ( 2 Hours ). For publishers they pick up the next > >time they attempt to send a message to the queue (because it's not > >acknowledged) and tend to hangup and reconnect on their own. > > > >you will observe after removing a rabbit node is that on a compute > >node ~1/3 rabbit connections re-establishes to the remaining rabbit > >node(s) while the other leave sockets open to the down server (using > >netstat, strace, lsof) > > > >fixes that don't work well > >* turning down TCP timeouts (LDPRELOAD or system-wide). While it will > >shorten from the 2 hour recovery, turning lower than 15 minutes leads > >to frequent false disconnects and tends towards bad behavior > >* rabbit in haproxy. This further masks the partial connection > >problem. Although we stopped using it, it might be better now with > >heartbeats enabled. > >* script to check for partial connections in rabbit server and > >forcibly close them. A partial solution that actually gets the job > >done the best besides hearbeats. It some times killed innocent > >connections for us. > > > >heartbeats fixes this by running a ping/ack in a separate channel & > >thread. This allows for the consumer to have a response from rabbit > >that will ensure that the connections have not gone away via stale > >sockets. When combined with CCN, it works in multiple failure > >condtions as expected and the rabbit consumers can be healthy within 1 > >minute. > > > > > >On Mon, Jan 19, 2015 at 2:55 PM, Gustavo Randich > ><gustavo.rand...@gmail.com> wrote: > >> In the meantime, I'm using this horrendous script inside compute nodes > >>to > >> check for rabbitmq connectivity. It uses the 'set_host_enabled' rpc > >>call, > >> which in my case is innocuous. > > > >This will still result in partial connections if you don't do CCN > > > >> > >> #!/bin/bash > >> UUID=$(cat /proc/sys/kernel/random/uuid) > >> RABBIT=$(grep -Po '(?<=rabbit_host = ).+' /etc/nova/nova.conf) > >> HOSTX=$(hostname) > >> python -c " > >> import pika > >> connection = > >>pika.BlockingConnection(pika.ConnectionParameters(\"$RABBIT\")) > >> channel = connection.channel() > >> channel.basic_publish(exchange='nova', routing_key=\"compute.$HOSTX\", > >> properties=pika.BasicProperties(content_type = 'application/json'), > >> body = '{ \"version\": \"3.0\", \"_context_request_id\": \"$UUID\", > >>\\ > >> \"_context_roles\": [\"KeystoneAdmin\", \"KeystoneServiceAdmin\", > >> \"admin\"], \\ > >> \"_context_user_id\": \"XXX\", \\ > >> \"_context_project_id\": \"XXX\", \\ > >> \"method\": \"set_host_enabled\", \\ > >> \"args\": {\"enabled\": true} \\ > >> }' > >> ) > >> connection.close()" > >> sleep 2 > >> tail -1000 /var/log/nova/nova-compute.log | grep -q $UUID || { echo > >> "WARNING: nova-compute not consuming RabbitMQ messages. Last message: > >> $UUID"; exit 1; } > >> echo "OK" > >> > >> > >> On Thu, Jan 15, 2015 at 9:48 PM, Sam Morrison <sorri...@gmail.com> > >>wrote: > >>> > >>> We¹ve had a lot of issues with Icehouse related to rabbitMQ. Basically > >>>the > >>> change from openstack.rpc to oslo.messaging broke things. These things > >>>are > >>> now fixed in oslo.messaging version 1.5.1, there is still an issue with > >>> heartbeats and that patch is making it¹s way through review process > >>>now. > >>> > >>> https://review.openstack.org/#/c/146047/ > >>> > >>> Cheers, > >>> Sam > >>> > >>> > >>> On 16 Jan 2015, at 10:55 am, sridhar basam <sridhar.ba...@gmail.com> > >>> wrote: > >>> > >>> > >>> If you are using ha queues, use a version of rabbitmq > 3.3.0. There > >>>was a > >>> change in that version where consumption on queues was automatically > >>>enabled > >>> when a master election for a queue happened. Previous versions only > >>>informed > >>> clients that they had to reconsume on a queue. It was the clients > >>> responsibility to start consumption on a queue. > >>> > >>> Make sure you enable tcp keepalives to a low enough value in case you > >>>have > >>> a firewall device in between your rabbit server and it's consumers. > >>> > >>> Monitor consumers on your rabbit infrastructure using 'rabbitmqctl > >>> list_queues name messages consumers'. Consumers on fanout queues is > >>>going to > >>> depend on the number of services of any type you have in your > >>>environment. > >>> > >>> Sri > >>> > >>> On Jan 15, 2015 6:27 PM, "Michael Dorman" <mdor...@godaddy.com> wrote: > >>>> > >>>> Here is the bug I¹ve been tracking related to this for a while. I > >>>> haven¹t really kept up to speed with it, so I don¹t know the current > >>>>status. > >>>> > >>>> https://bugs.launchpad.net/nova/+bug/856764 > >>>> > >>>> > >>>> From: Kris Lindgren <klindg...@godaddy.com> > >>>> Date: Thursday, January 15, 2015 at 12:10 PM > >>>> To: Gustavo Randich <gustavo.rand...@gmail.com>, OpenStack Operators > >>>> <openstack-operators@lists.openstack.org> > >>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq > >>>> connectivity > >>>> > >>>> During the Atlanta ops meeting this topic came up and I specifically > >>>> mentioned about adding a "no-op" or healthcheck ping to the rabbitmq > >>>>stuff > >>>> to both nova & neutron. The dev's in the room looked at me like I was > >>>> crazy, but it was so that we could exactly catch issues as you > >>>>described. I > >>>> am also interested if any one knows of a lightweight call that could > >>>>be used > >>>> to verify/confirm rabbitmq connectivity as well. I haven't been able > >>>>to > >>>> devote time to dig into it. Mainly because if one client is having > >>>>issues - > >>>> you will notice other clients are having similar/silent errors and a > >>>>restart > >>>> of all the things is the easiest way to fix, for us atleast. > >>>> ____________________________________________ > >>>> > >>>> Kris Lindgren > >>>> Senior Linux Systems Engineer > >>>> GoDaddy, LLC. > >>>> > >>>> > >>>> From: Gustavo Randich <gustavo.rand...@gmail.com> > >>>> Date: Thursday, January 15, 2015 at 11:53 AM > >>>> To: "openstack-operators@lists.openstack.org" > >>>> <openstack-operators@lists.openstack.org> > >>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq > >>>> connectivity > >>>> > >>>> Just to add one more background scenario, we also had similar problems > >>>> trying to load balance rabbitmq via F5 Big IP LTM. For that reason we > >>>>don't > >>>> use it now. Our installation is a single rabbitmq instance and no > >>>> intermediaries (albeit network switches). We use Folsom and Icehouse, > >>>>the > >>>> problem being perceived more in Icehouse nodes. > >>>> > >>>> We are already monitoring message queue size, but we would like to > >>>> pinpoint in semi-realtime the specific hosts/racks/network paths > >>>> experiencing the "stale connection" before a user complains about an > >>>> operation being stuck, or even hosts with no such pending operations > >>>>but > >>>> already "disconnected" -- we also could diagnose possible network > >>>>causes and > >>>> avoid massive service restarting. > >>>> > >>>> So, for now, if someone knows about a cheap and quick openstack > >>>>operation > >>>> that triggers a message interchange between rabbitmq and nova-compute > >>>>and a > >>>> way of checking the result it would be great. > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Jan 15, 2015 at 1:45 PM, Kris G. Lindgren > >>>><klindg...@godaddy.com> > >>>> wrote: > >>>>> > >>>>> We did have an issue using celery on an internal application that we > >>>>> wrote - but I believe it was fixed after much failover testing and > >>>>>code > >>>>> changes. We also use logstash via rabbitmq and haven't noticed any > >>>>>issues > >>>>> there either. > >>>>> > >>>>> So this seems to be just openstack/oslo related. > >>>>> > >>>>> We have tried a number of different configurations - all of them had > >>>>> their issues. We started out listing all the members in the cluster > >>>>>on the > >>>>> rabbit_hosts line. This worked most of the time without issue, > >>>>>until we > >>>>> would restart one of the servers, then it seemed like the clients > >>>>>wouldn't > >>>>> figure out they were disconnected and reconnect to the next host. > >>>>> > >>>>> In an attempt to solve that we moved to using harpoxy to present a > >>>>>vip > >>>>> that we configured in the rabbit_hosts line. This created issues > >>>>>with long > >>>>> lived connections disconnects and a bunch of other issues. In our > >>>>> production environment we moved to load balanced rabbitmq, but using > >>>>>a real > >>>>> loadbalancer, and don¹t have the weird disconnect issues. However, > >>>>>anytime > >>>>> we reboot/take down a rabbitmq host or pull a member from the > >>>>>cluster we > >>>>> have issues, or if their is a network disruption we also have issues. > >>>>> > >>>>> Thinking the best course of action is to move rabbitmq off on to its > >>>>>own > >>>>> box and to leave it alone. > >>>>> > >>>>> Does anyone have a rabbitmq setup that works well and doesn¹t have > >>>>> random issues when pulling nodes for maintenance? > >>>>> ____________________________________________ > >>>>> > >>>>> Kris Lindgren > >>>>> Senior Linux Systems Engineer > >>>>> GoDaddy, LLC. > >>>>> > >>>>> > >>>>> From: Joe Topjian <j...@topjian.net> > >>>>> Date: Thursday, January 15, 2015 at 9:29 AM > >>>>> To: "Kris G. Lindgren" <klindg...@godaddy.com> > >>>>> Cc: "openstack-operators@lists.openstack.org" > >>>>> <openstack-operators@lists.openstack.org> > >>>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq > >>>>> connectivity > >>>>> > >>>>> Hi Kris, > >>>>> > >>>>>> Our experience is pretty much the same on anything that is using > >>>>>> rabbitmq - not just nova-compute. > >>>>> > >>>>> > >>>>> Just to clarify: have you experienced this outside of OpenStack (or > >>>>> Oslo)? > >>>>> > >>>>> We've seen similar issues with rabbitmq and OpenStack. We used to run > >>>>> rabbit through haproxy and tried a myriad of options like setting no > >>>>> timeouts, very very long timeouts, etc, but would always eventually > >>>>>see > >>>>> similar issues as described. > >>>>> > >>>>> Last month, we reconfigured all OpenStack components to use the > >>>>> `rabbit_hosts` option with all nodes in our cluster listed. So far > >>>>>this has > >>>>> worked well, though I probably just jinxed myself. :) > >>>>> > >>>>> We still have other services (like Sensu) using the same rabbitmq > >>>>> cluster and accessing it through haproxy. We've never had any issues > >>>>>there. > >>>>> > >>>>> What's also strange is that I have another OpenStack deployment (from > >>>>> Folsom to Icehouse) with just a single rabbitmq server installed > >>>>>directly on > >>>>> the cloud controller (meaning: no nova-compute). I never have any > >>>>>rabbit > >>>>> issues in that cloud. > >>>>> > >>>>> _______________________________________________ > >>>>> OpenStack-operators mailing list > >>>>> OpenStack-operators@lists.openstack.org > >>>>> > >>>>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator > >>>>>s > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> OpenStack-operators mailing list > >>>> OpenStack-operators@lists.openstack.org > >>>> > >>>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > >>>> > >>> _______________________________________________ > >>> OpenStack-operators mailing list > >>> OpenStack-operators@lists.openstack.org > >>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > >>> > >>> > >>> > >>> _______________________________________________ > >>> OpenStack-operators mailing list > >>> OpenStack-operators@lists.openstack.org > >>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > >>> > >> > >> > >> _______________________________________________ > >> OpenStack-operators mailing list > >> OpenStack-operators@lists.openstack.org > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > >> > > > > > > > >-- > >Andrew > >Mirantis > >Ceph community > > > >_______________________________________________ > >OpenStack-operators mailing list > >OpenStack-operators@lists.openstack.org > >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > > > _______________________________________________ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >
_______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators