Thank you Andrew. on 2015/05/05 08:03, Andrew Beekhof wrote: >> On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya <bdobre...@mirantis.com> wrote: >> >>> Hello, >> Hello, Zhou >> >>> I using Fuel 6.0.1 and find that RabbitMQ recover time is long after >>> power failure. I have a running HA environment, then I reset power of >>> all the machines at the same time. I observe that after reboot it >>> usually takes 10 minutes for RabittMQ cluster to appear running >>> master-slave mode in pacemaker. If I power off all the 3 controllers and >>> only start 2 of them, the downtime sometimes can be as long as 20 minutes. >> Yes, this is a known issue [0]. Note, there were many bugfixes, like >> [1],[2],[3], merged for MQ OCF script, so you may want to try to >> backport them as well by the following guide [4] >> >> [0] https://bugs.launchpad.net/fuel/+bug/1432603 >> [1] https://review.openstack.org/#/c/175460/ >> [2] https://review.openstack.org/#/c/175457/ >> [3] https://review.openstack.org/#/c/175371/ >> [4] https://review.openstack.org/#/c/170476/ > Is there a reason you’re using a custom OCF script instead of the upstream[a] > one? > Please have a chat with David (the maintainer, in CC) if there is something > you believe is wrong with it. > > [a] > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster
I'm using the OCF script from the Fuel project, specifically from the "6.0" stable branch [alpha]. Comparing with upstream OCF code, the main difference is that Fuel RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more bookkeeping, for example, blocking client access when RabbitMQ cluster is not ready. I beleive the upstream OCF should be OK to use as well after I read the code, but it might not fit into the Fuel project. As far as I test, the Fuel OCF script is good except sometimes the full reassemble time is long, and as I find out, it is mostly because the Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ resource, as I mentioned in the previous emails. Maybe Vladimir and Sergey can give us more insight on why Fuel needs a master-slave RabbitMQ. I see Vladimir and Sergey works on the original Fuel blueprint "RabbitMQ cluster" [beta]. [alpha] https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq [beta] https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker >>> I have a little investigation and find out there are some possible causes. >>> >>> 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in >>> Pacemaker >>> >>> The pacemaker resource p_mysql start timeout is set to 475s. Sometimes >>> MySQL-wss fails to start after power failure, and pacemaker would wait >>> 475s before retry starting it. The problem is that pacemaker divides >>> resource state transitions into batches. Since RabbitMQ is master-slave >>> resource, I assume that starting all the slaves and promoting master are >>> put into two different batches. If unfortunately starting all RabbitMQ >>> slaves are put in the same batch as MySQL starting, even if RabbitMQ >>> slaves and all other resources are ready, pacemaker will not continue >>> but just wait for MySQL timeout. >> Could you please elaborate the what is the same/different batches for MQ >> and DB? Note, there is a MQ clustering logic flow charts available here >> [5] and we're planning to release a dedicated technical bulletin for this. >> >> [5] http://goo.gl/PPNrw7 >> >>> I can re-produce this by hard powering off all the controllers and start >>> them again. It's more likely to trigger MySQL failure in this way. Then >>> I observe that if there is one cloned mysql instance not starting, the >>> whole pacemaker cluster gets stuck and does not emit any log. On the >>> host of the failed instance, I can see a mysql resource agent process >>> calling the sleep command. If I kill that process, the pacemaker comes >>> back alive and RabbitMQ master gets promoted. In fact this long timeout >>> is blocking every resource from state transition in pacemaker. >>> >>> This maybe a known problem of pacemaker and there are some discussions >>> in Linux-HA mailing list [2]. It might not be fixed in the near future. >>> It seems in generally it's bad to have long timeout in state transition >>> actions (start/stop/promote/demote). There maybe another way to >>> implement MySQL-wss resource agent to use a short start timeout and >>> monitor the wss cluster state using monitor action. >> This is very interesting, thank you! I believe all commands for MySQL RA >> OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL >> as we did for MQ RA OCF. And there should no be any sleep calls. I >> created a bug for this [6]. >> >> [6] https://bugs.launchpad.net/fuel/+bug/1449542 >> >>> I also find a fix to improve MySQL start timeout [3]. It shortens the >>> timeout to 300s. At the time I sending this email, I can not find it in >>> stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to >>> stable/6.0 ? >>> >>> [1] https://bugs.launchpad.net/fuel/+bug/1441885 >>> [2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html >>> [3] https://review.openstack.org/#/c/171333/ >>> >>> >>> 2. RabbitMQ Resource Agent Breaks Existing Cluster >>> >>> Read the code of the RabbitMQ resource agent, I find it does the >>> following to start RabbitMQ master-slave cluster. >>> On all the controllers: >>> (1) Start Erlang beam process >>> (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) >>> (3) Stop RabbitMQ App but do not stop the beam process >>> >>> Then in pacemaker, all the RabbitMQ instances are in slave state. After >>> pacemaker determines the master, it does the following. >>> On the to-be-master host: >>> (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) >>> On the slaves hosts: >>> (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) >>> (6) Join RabbitMQ cluster of the master host >>> >> Yes, something like that. As I mentioned, there were several bug fixes >> in the 6.1 dev, and you can also check the MQ clustering flow charts. >> >>> As far as I can understand, this process is to make sure the master >>> determined by pacemaker is the same as the master determined in RabbitMQ >>> cluster. If there is no existing cluster, it's fine. If it is run >> after >> >> Not exactly. There is no master in mirrored MQ cluster. We define the >> rabbit_hosts configuration option from Oslo.messaging. What ensures all >> queue masters will be spread around all of MQ nodes in a long run. And >> we use a master abstraction only for the Pacemaker RA clustering layer. >> Here, a "master" is the MQ node what joins the rest of the MQ nodes. >> >>> power failure and recovery, it introduces the a new problem. >> We do erase the node master attribute in CIB for such cases. This should >> not bring problems into the master election logic. >> >>> After power recovery, if some of the RabbitMQ instances reach step (2) >>> roughly at the same time (within 30s which is hard coded in RabbitMQ) as >>> the original RabbitMQ master instance, they form the original cluster >>> again and then shutdown. The other instances would have to wait for 30s >>> before it reports failure waiting for tables, and be reset to a >>> standalone cluster. >>> >>> In RabbitMQ documentation [4], it is also mentioned that if we shutdown >>> RabbitMQ master, a new master is elected from the rest of slaves. If we >> (Note, the RabbitMQ documentation mentions *queue* masters and slaves, >> which are not the case for the Pacemaker RA clustering abstraction layer.) >> >>> continue to shutdown nodes in step (3), we reach a point that the last >>> node is the RabbitMQ master, and pacemaker is not aware of it. I can see >>> there is code to bookkeeping a "rabbit-start-time" attribute in >>> pacemaker to record the most long lived instance to help pacemaker >>> determine the master, but it does not cover the case mentioned above. >> We made an assumption what the node with the highest MQ uptime should >> know the most about recent cluster state, so other nodes must join it. >> RA OCF does not work with queue masters directly. >> >>> A >>> recent patch [5] checks existing "rabbit-master" attribute but it >>> neither cover the above case. >>> >>> So in step (4), pacemaker determines a different master which was a >>> RabbitMQ slave last time. It would wait for its original RabbitMQ master >>> for 30s and fail, then it gets reset to a standalone cluster. Here we >>> get some different clusters, so in step (5) and (6), it is likely to >>> report error in log saying timeout waiting for tables or fail to merge >>> mnesia database schema, then the those instances get reset. You can >>> easily re-produce the case by hard resetting power of all the controllers. >>> >>> As you can see, if you are unlucky, there would be several "30s timeout >>> and reset" before you finally get a healthy RabbitMQ cluster. >> The full MQ cluster reassemble logic is far from the perfect state, >> indeed. This might erase all mnesia files, hence any custom entities, >> like users or vhosts, would be removed as well. Note, we do not >> configure durable queues for Openstack so there is nothing to care about >> here - the full cluster downtime assumes there will be no AMQP messages >> stored at all. >> >>> I find three possible solutions. >>> A. Using rabbitmqctl force_boot option [6] >>> It will skips waiting for 30s and resetting cluster, but just assume the >>> current node is the master and continue to operate. This is feasible >>> because the original RabbitMQ master would discards the local state and >>> sync with the new master after it joins a new cluster [7]. So we can be >>> sure that after step (4) and (6), the pacemaker determined master >>> instance is started unconditionally, and it will be the same as RabbitMQ >>> master, and all operations run without 30s timeout. I find this option >>> is only available in newer RabbitMQ release, and updating RabbitMQ might >>> introduce other compatibility problems. >> Yes, this option is only supported for newest RabbitMQ versions. But we >> definitely should look how this could help. >> >>> B. Turn RabbitMQ into cloned instance and use pause_minority instead of >>> autoheal [8] >> Indeed, there are cases when MQ's autoheal can do nothing with existing >> partitions and remains partitioned for ever, for example: >> >> Masters: [ node-1 ] >> Slaves: [ node-2 node-3 ] >> root@node-1:~# rabbitmqctl cluster_status >> Cluster status of node 'rabbit@node-1' ... >> [{nodes,[{disc,['rabbit@node-1','rabbit@node-2']}]}, >> {running_nodes,['rabbit@node-1']}, >> {cluster_name,<<"rabbit@node-2">>}, >> {partitions,[]}] >> ...done. >> root@node-2:~# rabbitmqctl cluster_status >> Cluster status of node 'rabbit@node-2' ... >> [{nodes,[{disc,['rabbit@node-2']}]}] >> ...done. >> root@node-3:~# rabbitmqctl cluster_status >> Cluster status of node 'rabbit@node-3' ... >> [{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]}, >> {running_nodes,['rabbit@node-3']}, >> {cluster_name,<<"rabbit@node-2">>}, >> {partitions,[]}] >> >> So we should test the pause-minority value as well. >> But I strongly believe we should make MQ multi-state clone to support >> many masters, related bp [7] >> >> [7] >> https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone >> >>> This works like MySQL-wss. It let RabbitMQ cluster itself deal with >>> partition in a manner similar to pacemaker quorum mechanism. When there >>> is network partition, instances in the minority partition pauses >>> themselves automatically. Pacemaker does not have to track who is the >>> RabbitMQ master, who lives longest, who to promote... It just starts all >>> the clones, done. This leads to huge change in RabbitMQ resource agent, >>> and the stability and other impact is to be tested. >> Well, we should not mess the queue masters and multi-clone master for MQ >> resource in the pacemaker. >> As I said, pacemaker RA has nothing to do with queue masters. And we >> introduced this "master" mostly in order to support the full cluster >> reassemble case - there must be a node promoted and other nodes should join. >> >>> C. Creating a "force_load" file >>> After reading RabbitMQ source code, I find that the actual thing it does >>> in solution A is just creating an empty file named "force_load" in >>> mnesia database dir, then mnesia thinks it is the last node shut down in >>> the last time and boot itself as the master. This implementation keeps >>> the same from v3.1.4 to the latest RabbitMQ master branch. I think we >>> can make use of this little trick. The change is adding just one line in >>> "try_to_start_rmq_app()" function. >>> >>> touch "${MNESIA_FILES}/force_load" && \ >>> chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load" >> This is a very good point, thank you. >> >>> [4] http://www.rabbitmq.com/ha.html >>> [5] https://review.openstack.org/#/c/169291/ >>> [6] https://www.rabbitmq.com/clustering.html >>> [7] http://www.rabbitmq.com/partitions.html#recovering >>> [8] http://www.rabbitmq.com/partitions.html#automatic-handling >>> >>> Maybe you have better ideas on this. Please share your thoughts. >> Thank you for a thorough feedback! This was a really great job. >> >>> ---- >>> Best wishes! >>> Zhou Zheng Sheng / ??? Software Engineer >>> Beijing AWcloud Software Co., Ltd. >>> >> >> -- >> Best regards, >> Bogdan Dobrelya, >> Skype #bogdando_at_yahoo.com >> Irc #bogdando >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev