> On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya <[email protected]> wrote: > >> Hello, > > Hello, Zhou > >> >> I using Fuel 6.0.1 and find that RabbitMQ recover time is long after >> power failure. I have a running HA environment, then I reset power of >> all the machines at the same time. I observe that after reboot it >> usually takes 10 minutes for RabittMQ cluster to appear running >> master-slave mode in pacemaker. If I power off all the 3 controllers and >> only start 2 of them, the downtime sometimes can be as long as 20 minutes. > > Yes, this is a known issue [0]. Note, there were many bugfixes, like > [1],[2],[3], merged for MQ OCF script, so you may want to try to > backport them as well by the following guide [4] > > [0] https://bugs.launchpad.net/fuel/+bug/1432603 > [1] https://review.openstack.org/#/c/175460/ > [2] https://review.openstack.org/#/c/175457/ > [3] https://review.openstack.org/#/c/175371/ > [4] https://review.openstack.org/#/c/170476/
Is there a reason you’re using a custom OCF script instead of the upstream[a] one? Please have a chat with David (the maintainer, in CC) if there is something you believe is wrong with it. [a] https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster > >> >> I have a little investigation and find out there are some possible causes. >> >> 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in >> Pacemaker >> >> The pacemaker resource p_mysql start timeout is set to 475s. Sometimes >> MySQL-wss fails to start after power failure, and pacemaker would wait >> 475s before retry starting it. The problem is that pacemaker divides >> resource state transitions into batches. Since RabbitMQ is master-slave >> resource, I assume that starting all the slaves and promoting master are >> put into two different batches. If unfortunately starting all RabbitMQ >> slaves are put in the same batch as MySQL starting, even if RabbitMQ >> slaves and all other resources are ready, pacemaker will not continue >> but just wait for MySQL timeout. > > Could you please elaborate the what is the same/different batches for MQ > and DB? Note, there is a MQ clustering logic flow charts available here > [5] and we're planning to release a dedicated technical bulletin for this. > > [5] http://goo.gl/PPNrw7 > >> >> I can re-produce this by hard powering off all the controllers and start >> them again. It's more likely to trigger MySQL failure in this way. Then >> I observe that if there is one cloned mysql instance not starting, the >> whole pacemaker cluster gets stuck and does not emit any log. On the >> host of the failed instance, I can see a mysql resource agent process >> calling the sleep command. If I kill that process, the pacemaker comes >> back alive and RabbitMQ master gets promoted. In fact this long timeout >> is blocking every resource from state transition in pacemaker. >> >> This maybe a known problem of pacemaker and there are some discussions >> in Linux-HA mailing list [2]. It might not be fixed in the near future. >> It seems in generally it's bad to have long timeout in state transition >> actions (start/stop/promote/demote). There maybe another way to >> implement MySQL-wss resource agent to use a short start timeout and >> monitor the wss cluster state using monitor action. > > This is very interesting, thank you! I believe all commands for MySQL RA > OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL > as we did for MQ RA OCF. And there should no be any sleep calls. I > created a bug for this [6]. > > [6] https://bugs.launchpad.net/fuel/+bug/1449542 > >> >> I also find a fix to improve MySQL start timeout [3]. It shortens the >> timeout to 300s. At the time I sending this email, I can not find it in >> stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to >> stable/6.0 ? >> >> [1] https://bugs.launchpad.net/fuel/+bug/1441885 >> [2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html >> [3] https://review.openstack.org/#/c/171333/ >> >> >> 2. RabbitMQ Resource Agent Breaks Existing Cluster >> >> Read the code of the RabbitMQ resource agent, I find it does the >> following to start RabbitMQ master-slave cluster. >> On all the controllers: >> (1) Start Erlang beam process >> (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) >> (3) Stop RabbitMQ App but do not stop the beam process >> >> Then in pacemaker, all the RabbitMQ instances are in slave state. After >> pacemaker determines the master, it does the following. >> On the to-be-master host: >> (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) >> On the slaves hosts: >> (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) >> (6) Join RabbitMQ cluster of the master host >> > > Yes, something like that. As I mentioned, there were several bug fixes > in the 6.1 dev, and you can also check the MQ clustering flow charts. > >> As far as I can understand, this process is to make sure the master >> determined by pacemaker is the same as the master determined in RabbitMQ >> cluster. If there is no existing cluster, it's fine. If it is run > after > > Not exactly. There is no master in mirrored MQ cluster. We define the > rabbit_hosts configuration option from Oslo.messaging. What ensures all > queue masters will be spread around all of MQ nodes in a long run. And > we use a master abstraction only for the Pacemaker RA clustering layer. > Here, a "master" is the MQ node what joins the rest of the MQ nodes. > >> power failure and recovery, it introduces the a new problem. > > We do erase the node master attribute in CIB for such cases. This should > not bring problems into the master election logic. > >> >> After power recovery, if some of the RabbitMQ instances reach step (2) >> roughly at the same time (within 30s which is hard coded in RabbitMQ) as >> the original RabbitMQ master instance, they form the original cluster >> again and then shutdown. The other instances would have to wait for 30s >> before it reports failure waiting for tables, and be reset to a >> standalone cluster. >> >> In RabbitMQ documentation [4], it is also mentioned that if we shutdown >> RabbitMQ master, a new master is elected from the rest of slaves. If we > > (Note, the RabbitMQ documentation mentions *queue* masters and slaves, > which are not the case for the Pacemaker RA clustering abstraction layer.) > >> continue to shutdown nodes in step (3), we reach a point that the last >> node is the RabbitMQ master, and pacemaker is not aware of it. I can see >> there is code to bookkeeping a "rabbit-start-time" attribute in >> pacemaker to record the most long lived instance to help pacemaker >> determine the master, but it does not cover the case mentioned above. > > We made an assumption what the node with the highest MQ uptime should > know the most about recent cluster state, so other nodes must join it. > RA OCF does not work with queue masters directly. > >> A >> recent patch [5] checks existing "rabbit-master" attribute but it >> neither cover the above case. >> >> So in step (4), pacemaker determines a different master which was a >> RabbitMQ slave last time. It would wait for its original RabbitMQ master >> for 30s and fail, then it gets reset to a standalone cluster. Here we >> get some different clusters, so in step (5) and (6), it is likely to >> report error in log saying timeout waiting for tables or fail to merge >> mnesia database schema, then the those instances get reset. You can >> easily re-produce the case by hard resetting power of all the controllers. >> >> As you can see, if you are unlucky, there would be several "30s timeout >> and reset" before you finally get a healthy RabbitMQ cluster. > > The full MQ cluster reassemble logic is far from the perfect state, > indeed. This might erase all mnesia files, hence any custom entities, > like users or vhosts, would be removed as well. Note, we do not > configure durable queues for Openstack so there is nothing to care about > here - the full cluster downtime assumes there will be no AMQP messages > stored at all. > >> >> I find three possible solutions. >> A. Using rabbitmqctl force_boot option [6] >> It will skips waiting for 30s and resetting cluster, but just assume the >> current node is the master and continue to operate. This is feasible >> because the original RabbitMQ master would discards the local state and >> sync with the new master after it joins a new cluster [7]. So we can be >> sure that after step (4) and (6), the pacemaker determined master >> instance is started unconditionally, and it will be the same as RabbitMQ >> master, and all operations run without 30s timeout. I find this option >> is only available in newer RabbitMQ release, and updating RabbitMQ might >> introduce other compatibility problems. > > Yes, this option is only supported for newest RabbitMQ versions. But we > definitely should look how this could help. > >> >> B. Turn RabbitMQ into cloned instance and use pause_minority instead of >> autoheal [8] > > Indeed, there are cases when MQ's autoheal can do nothing with existing > partitions and remains partitioned for ever, for example: > > Masters: [ node-1 ] > Slaves: [ node-2 node-3 ] > root@node-1:~# rabbitmqctl cluster_status > Cluster status of node 'rabbit@node-1' ... > [{nodes,[{disc,['rabbit@node-1','rabbit@node-2']}]}, > {running_nodes,['rabbit@node-1']}, > {cluster_name,<<"rabbit@node-2">>}, > {partitions,[]}] > ...done. > root@node-2:~# rabbitmqctl cluster_status > Cluster status of node 'rabbit@node-2' ... > [{nodes,[{disc,['rabbit@node-2']}]}] > ...done. > root@node-3:~# rabbitmqctl cluster_status > Cluster status of node 'rabbit@node-3' ... > [{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]}, > {running_nodes,['rabbit@node-3']}, > {cluster_name,<<"rabbit@node-2">>}, > {partitions,[]}] > > So we should test the pause-minority value as well. > But I strongly believe we should make MQ multi-state clone to support > many masters, related bp [7] > > [7] > https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone > >> This works like MySQL-wss. It let RabbitMQ cluster itself deal with >> partition in a manner similar to pacemaker quorum mechanism. When there >> is network partition, instances in the minority partition pauses >> themselves automatically. Pacemaker does not have to track who is the >> RabbitMQ master, who lives longest, who to promote... It just starts all >> the clones, done. This leads to huge change in RabbitMQ resource agent, >> and the stability and other impact is to be tested. > > Well, we should not mess the queue masters and multi-clone master for MQ > resource in the pacemaker. > As I said, pacemaker RA has nothing to do with queue masters. And we > introduced this "master" mostly in order to support the full cluster > reassemble case - there must be a node promoted and other nodes should join. > >> >> C. Creating a "force_load" file >> After reading RabbitMQ source code, I find that the actual thing it does >> in solution A is just creating an empty file named "force_load" in >> mnesia database dir, then mnesia thinks it is the last node shut down in >> the last time and boot itself as the master. This implementation keeps >> the same from v3.1.4 to the latest RabbitMQ master branch. I think we >> can make use of this little trick. The change is adding just one line in >> "try_to_start_rmq_app()" function. >> >> touch "${MNESIA_FILES}/force_load" && \ >> chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load" > > This is a very good point, thank you. > >> >> [4] http://www.rabbitmq.com/ha.html >> [5] https://review.openstack.org/#/c/169291/ >> [6] https://www.rabbitmq.com/clustering.html >> [7] http://www.rabbitmq.com/partitions.html#recovering >> [8] http://www.rabbitmq.com/partitions.html#automatic-handling >> >> Maybe you have better ideas on this. Please share your thoughts. > > Thank you for a thorough feedback! This was a really great job. > >> >> ---- >> Best wishes! >> Zhou Zheng Sheng / ??? Software Engineer >> Beijing AWcloud Software Co., Ltd. >> > > > -- > Best regards, > Bogdan Dobrelya, > Skype #bogdando_at_yahoo.com > Irc #bogdando > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
