Thank you Andrew.

on 2015/05/05 08:03, Andrew Beekhof wrote:
>> On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya <bdobre...@mirantis.com> wrote:
>>
>>> Hello,
>> Hello, Zhou
>>
>>> I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
>>> power failure. I have a running HA environment, then I reset power of
>>> all the machines at the same time. I observe that after reboot it
>>> usually takes 10 minutes for RabittMQ cluster to appear running
>>> master-slave mode in pacemaker. If I power off all the 3 controllers and
>>> only start 2 of them, the downtime sometimes can be as long as 20 minutes.
>> Yes, this is a known issue [0]. Note, there were many bugfixes, like
>> [1],[2],[3], merged for MQ OCF script, so you may want to try to
>> backport them as well by the following guide [4]
>>
>> [0] https://bugs.launchpad.net/fuel/+bug/1432603
>> [1] https://review.openstack.org/#/c/175460/
>> [2] https://review.openstack.org/#/c/175457/
>> [3] https://review.openstack.org/#/c/175371/
>> [4] https://review.openstack.org/#/c/170476/
> Is there a reason you’re using a custom OCF script instead of the upstream[a] 
> one?
> Please have a chat with David (the maintainer, in CC) if there is something 
> you believe is wrong with it.
>
> [a] 
> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster

I'm using the OCF script from the Fuel project, specifically from the
"6.0" stable branch [alpha].

Comparing with upstream OCF code, the main difference is that Fuel
RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
bookkeeping, for example, blocking client access when RabbitMQ cluster
is not ready. I beleive the upstream OCF should be OK to use as well
after I read the code, but it might not fit into the Fuel project. As
far as I test, the Fuel OCF script is good except sometimes the full
reassemble time is long, and as I find out, it is mostly because the
Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
resource, as I mentioned in the previous emails.

Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
master-slave RabbitMQ. I see Vladimir and Sergey works on the original
Fuel blueprint "RabbitMQ cluster" [beta].

[alpha]
https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq
[beta]
https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker

>>> I have a little investigation and find out there are some possible causes.
>>>
>>> 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
>>> Pacemaker
>>>
>>> The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
>>> MySQL-wss fails to start after power failure, and pacemaker would wait
>>> 475s before retry starting it. The problem is that pacemaker divides
>>> resource state transitions into batches. Since RabbitMQ is master-slave
>>> resource, I assume that starting all the slaves and promoting master are
>>> put into two different batches. If unfortunately starting all RabbitMQ
>>> slaves are put in the same batch as MySQL starting, even if RabbitMQ
>>> slaves and all other resources are ready, pacemaker will not continue
>>> but just wait for MySQL timeout.
>> Could you please elaborate the what is the same/different batches for MQ
>> and DB? Note, there is a MQ clustering logic flow charts available here
>> [5] and we're planning to release a dedicated technical bulletin for this.
>>
>> [5] http://goo.gl/PPNrw7
>>
>>> I can re-produce this by hard powering off all the controllers and start
>>> them again. It's more likely to trigger MySQL failure in this way. Then
>>> I observe that if there is one cloned mysql instance not starting, the
>>> whole pacemaker cluster gets stuck and does not emit any log. On the
>>> host of the failed instance, I can see a mysql resource agent process
>>> calling the sleep command. If I kill that process, the pacemaker comes
>>> back alive and RabbitMQ master gets promoted. In fact this long timeout
>>> is blocking every resource from state transition in pacemaker.
>>>
>>> This maybe a known problem of pacemaker and there are some discussions
>>> in Linux-HA mailing list [2]. It might not be fixed in the near future.
>>> It seems in generally it's bad to have long timeout in state transition
>>> actions (start/stop/promote/demote). There maybe another way to
>>> implement MySQL-wss resource agent to use a short start timeout and
>>> monitor the wss cluster state using monitor action.
>> This is very interesting, thank you! I believe all commands for MySQL RA
>> OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
>> as we did for MQ RA OCF. And there should no be any sleep calls. I
>> created a bug for this [6].
>>
>> [6] https://bugs.launchpad.net/fuel/+bug/1449542
>>
>>> I also find a fix to improve MySQL start timeout [3]. It shortens the
>>> timeout to 300s. At the time I sending this email, I can not find it in
>>> stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
>>> stable/6.0 ?
>>>
>>> [1] https://bugs.launchpad.net/fuel/+bug/1441885
>>> [2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
>>> [3] https://review.openstack.org/#/c/171333/
>>>
>>>
>>> 2. RabbitMQ Resource Agent Breaks Existing Cluster
>>>
>>> Read the code of the RabbitMQ resource agent, I find it does the
>>> following to start RabbitMQ master-slave cluster.
>>> On all the controllers:
>>> (1) Start Erlang beam process
>>> (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
>>> (3) Stop RabbitMQ App but do not stop the beam process
>>>
>>> Then in pacemaker, all the RabbitMQ instances are in slave state. After
>>> pacemaker determines the master, it does the following.
>>> On the to-be-master host:
>>> (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
>>> On the slaves hosts:
>>> (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
>>> (6) Join RabbitMQ cluster of the master host
>>>
>> Yes, something like that. As I mentioned, there were several bug fixes
>> in the 6.1 dev, and you can also check the MQ clustering flow charts.
>>
>>> As far as I can understand, this process is to make sure the master
>>> determined by pacemaker is the same as the master determined in RabbitMQ
>>> cluster. If there is no existing cluster, it's fine. If it is run
>> after
>>
>> Not exactly. There is no master in mirrored MQ cluster. We define the
>> rabbit_hosts configuration option from Oslo.messaging. What ensures all
>> queue masters will be spread around all of MQ nodes in a long run. And
>> we use a master abstraction only for the Pacemaker RA clustering layer.
>> Here, a "master" is the MQ node what joins the rest of the MQ nodes.
>>
>>> power failure and recovery, it introduces the a new problem.
>> We do erase the node master attribute in CIB for such cases. This should
>> not bring problems into the master election logic.
>>
>>> After power recovery, if some of the RabbitMQ instances reach step (2)
>>> roughly at the same time (within 30s which is hard coded in RabbitMQ) as
>>> the original RabbitMQ master instance, they form the original cluster
>>> again and then shutdown. The other instances would have to wait for 30s
>>> before it reports failure waiting for tables, and be  reset to a
>>> standalone cluster.
>>>
>>> In RabbitMQ documentation [4], it is also mentioned that if we shutdown
>>> RabbitMQ master, a new master is elected from the rest of slaves. If we
>> (Note, the RabbitMQ documentation mentions *queue* masters and slaves,
>> which are not the case for the Pacemaker RA clustering abstraction layer.)
>>
>>> continue to shutdown nodes in step (3), we reach a point that the last
>>> node is the RabbitMQ master, and pacemaker is not aware of it. I can see
>>> there is code to bookkeeping a "rabbit-start-time" attribute in
>>> pacemaker to record the most long lived instance to help pacemaker
>>> determine the master, but it does not cover the case mentioned above.
>> We made an assumption what the node with the highest MQ uptime should
>> know the most about recent cluster state, so other nodes must join it.
>> RA OCF does not work with queue masters directly.
>>
>>> A
>>> recent patch [5] checks existing "rabbit-master" attribute but it
>>> neither cover the above case.
>>>
>>> So in step (4), pacemaker determines a different master which was a
>>> RabbitMQ slave last time. It would wait for its original RabbitMQ master
>>> for 30s and fail, then it gets reset to a standalone cluster. Here we
>>> get some different clusters, so in step (5) and (6), it is likely to
>>> report error in log saying timeout waiting for tables or fail to merge
>>> mnesia database schema, then the those instances get reset. You can
>>> easily re-produce the case by hard resetting power of all the controllers.
>>>
>>> As you can see, if you are unlucky, there would be several "30s timeout
>>> and reset" before you finally get a healthy RabbitMQ cluster.
>> The full MQ cluster reassemble logic is far from the perfect state,
>> indeed. This might erase all mnesia files, hence any custom entities,
>> like users or vhosts, would be removed as well. Note, we do not
>> configure durable queues for Openstack so there is nothing to care about
>> here - the full cluster downtime assumes there will be no AMQP messages
>> stored at all.
>>
>>> I find three possible solutions.
>>> A. Using rabbitmqctl force_boot option [6]
>>> It will skips waiting for 30s and resetting cluster, but just assume the
>>> current node is the master and continue to operate. This is feasible
>>> because the original RabbitMQ master would discards the local state and
>>> sync with the new master after it joins a new cluster [7]. So we can be
>>> sure that after step (4) and (6), the pacemaker determined master
>>> instance is started unconditionally, and it will be the same as RabbitMQ
>>> master, and all operations run without 30s timeout. I find this option
>>> is only available in newer RabbitMQ release, and updating RabbitMQ might
>>> introduce other compatibility problems.
>> Yes, this option is only supported for newest RabbitMQ versions. But we
>> definitely should look how this could help.
>>
>>> B. Turn RabbitMQ into cloned instance and use pause_minority instead of
>>> autoheal [8]
>> Indeed, there are cases when MQ's autoheal can do nothing with existing
>> partitions and remains partitioned for ever, for example:
>>
>> Masters: [ node-1 ]
>> Slaves: [ node-2 node-3 ]
>> root@node-1:~# rabbitmqctl cluster_status
>> Cluster status of node 'rabbit@node-1' ...
>> [{nodes,[{disc,['rabbit@node-1','rabbit@node-2']}]},
>> {running_nodes,['rabbit@node-1']},
>> {cluster_name,<<"rabbit@node-2">>},
>> {partitions,[]}]
>> ...done.
>> root@node-2:~# rabbitmqctl cluster_status
>> Cluster status of node 'rabbit@node-2' ...
>> [{nodes,[{disc,['rabbit@node-2']}]}]
>> ...done.
>> root@node-3:~# rabbitmqctl cluster_status
>> Cluster status of node 'rabbit@node-3' ...
>> [{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
>> {running_nodes,['rabbit@node-3']},
>> {cluster_name,<<"rabbit@node-2">>},
>> {partitions,[]}]
>>
>> So we should test the pause-minority value as well.
>> But I strongly believe we should make MQ multi-state clone to support
>> many masters, related bp [7]
>>
>> [7]
>> https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone
>>
>>> This works like MySQL-wss. It let RabbitMQ cluster itself deal with
>>> partition in a manner similar to pacemaker quorum mechanism. When there
>>> is network partition, instances in the minority partition pauses
>>> themselves automatically. Pacemaker does not have to track who is the
>>> RabbitMQ master, who lives longest, who to promote... It just starts all
>>> the clones, done. This leads to huge change in RabbitMQ resource agent,
>>> and the stability and other impact is to be tested.
>> Well, we should not mess the queue masters and multi-clone master for MQ
>> resource in the pacemaker.
>> As I said, pacemaker RA has nothing to do with queue masters. And we
>> introduced this "master" mostly in order to support the full cluster
>> reassemble case - there must be a node promoted and other nodes should join.
>>
>>> C. Creating a "force_load" file
>>> After reading RabbitMQ source code, I find that the actual thing it does
>>> in solution A is just creating an empty file named "force_load" in
>>> mnesia database dir, then mnesia thinks it is the last node shut down in
>>> the last time and boot itself as the master. This implementation keeps
>>> the same from v3.1.4 to the latest RabbitMQ master branch. I think we
>>> can make use of this little trick. The change is adding just one line in
>>> "try_to_start_rmq_app()" function.
>>>
>>> touch "${MNESIA_FILES}/force_load" && \
>>>  chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load"
>> This is a very good point, thank you.
>>
>>> [4] http://www.rabbitmq.com/ha.html
>>> [5] https://review.openstack.org/#/c/169291/
>>> [6] https://www.rabbitmq.com/clustering.html
>>> [7] http://www.rabbitmq.com/partitions.html#recovering
>>> [8] http://www.rabbitmq.com/partitions.html#automatic-handling
>>>
>>> Maybe you have better ideas on this. Please share your thoughts.
>> Thank you for a thorough feedback! This was a really great job.
>>
>>> ----
>>> Best wishes!
>>> Zhou Zheng Sheng / ???  Software Engineer
>>> Beijing AWcloud Software Co., Ltd.
>>>
>>
>> -- 
>> Best regards,
>> Bogdan Dobrelya,
>> Skype #bogdando_at_yahoo.com
>> Irc #bogdando
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to