> On 12 Nov 2015, at 10:44 PM, Vladimir Kuklin <vkuk...@mirantis.com> wrote: > > Hi, Andrew > > >Ah good, I understood it correctly then :) > > I would be interested in your opinion of how the other agent does the > > bootstrapping (ie. without notifications or master/slave). > >That makes sense, the part I’m struggling with is that it sounds like the > >other agent shouldn’t work at all. > > Yet we’ve used it extensively and not experienced these kinds of hangs. > Regarding other scripts - I am not aware of any other scripts that actually > handle cloned rabbitmq server. I may be mistaking, of course. So if you are > aware if these scripts succeed in creating rabbitmq cluster which actually > survives 1-node or all-node failure scenarios and reassembles the cluster > automatically - please, let us know.
The one I linked to in my original reply does: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster > > > Changing the state isn’t ideal but there is precedent, the part that has me > > concerned is the error codes coming out of notify. > > Apart from producing some log messages, I can’t think how it would produce > > any recovery. > > > Unless you’re relying on the subsequent monitor operation to notice the > > error state. > > I guess that would work but you might be waiting a while for it to notice. > > Yes, we are relying on subsequent monitor operations. We also have several > OCF check levels to catch a case when one node does not have rabbitmq > application started properly (btw, there was a strange bug that we had to > wait for several non-zero checks to fail to get the resource to restart > http://bugs.clusterlabs.org/show_bug.cgi?id=5243) . It appears I misunderstood your bug the first time around :-( Do you still have logs of this occuring? > I now remember, why we did notify errors - for error logging, I guess. > > > On Thu, Nov 12, 2015 at 1:30 AM, Andrew Beekhof <abeek...@redhat.com> wrote: > > > On 11 Nov 2015, at 11:35 PM, Vladimir Kuklin <vkuk...@mirantis.com> wrote: > > > > Hi, Andrew > > > > Let me answer your questions. > > > > This agent is active/active which actually marks one of the node as > > 'pseudo'-master which is used as a target for other nodes to join to. We > > also check which node is a master and use it in monitor action to check > > whether this node is clustered with this 'master' node. When we do cluster > > bootstrap, we need to decide which node to mark as a master node. Then, > > when it starts (actually, promotes), we can finally pick its name through > > notification mechanism and ask other nodes to join this cluster. > > Ah good, I understood it correctly then :) > I would be interested in your opinion of how the other agent does the > bootstrapping (ie. without notifications or master/slave). > > > > > Regarding disconnect_node+forget_cluster_node this is quite simple - we > > need to eject node from the cluster. Otherwise it is mentioned in the list > > of cluster nodes and a lot of cluster actions, e.g. list_queues, will hang > > forever as well as forget_cluster_node action. > > That makes sense, the part I’m struggling with is that it sounds like the > other agent shouldn’t work at all. > Yet we’ve used it extensively and not experienced these kinds of hangs. > > > > > We also handle this case whenever a node leaves the cluster. If you > > remember, I wrote an email to Pacemaker ML regarding getting notifications > > on node unjoin event '[openstack-dev] [Fuel][Pacemaker][HA] Notifying > > clones of offline nodes’. > > Oh, I recall that now. > > > So we went another way and added a dbus daemon listener that does the same > > when node lefts corosync cluster (we know that this is a little bit racy, > > but disconnect+forget actions pair is idempotent). > > > > Regarding notification commands - we changed behaviour to the one that > > fitter our use cases better and passed our destructive tests. It could be > > Pacemaker-version dependent, so I agree we should consider changing this > > behaviour. But so far it worked for us. > > Changing the state isn’t ideal but there is precedent, the part that has me > concerned is the error codes coming out of notify. > Apart from producing some log messages, I can’t think how it would produce > any recovery. > > Unless you’re relying on the subsequent monitor operation to notice the error > state. > I guess that would work but you might be waiting a while for it to notice. > > > > > On Wed, Nov 11, 2015 at 2:12 PM, Andrew Beekhof <abeek...@redhat.com> wrote: > > > > > On 11 Nov 2015, at 6:26 PM, bdobre...@mirantis.com wrote: > > > > > > Thank you Andrew. > > > Answers below. > > > >>> > > > Sounds interesting, can you give any comment about how it differs to the > > > other[i] upstream agent? > > > Am I right that this one is effectively A/P and wont function without > > > some kind of shared storage? > > > Any particular reason you went down this path instead of full A/A? > > > > > > [i] > > > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster > > > <<< > > > It is based on multistate clone notifications. It requries nothing shared > > > but Corosync info base CIB where all Pacemaker resources stored anyway. > > > And it is fully A/A. > > > > Oh! So I should skip the A/P parts before "Auto-configuration of a cluster > > with a Pacemaker”? > > Is the idea that the master mode is for picking a node to bootstrap the > > cluster? > > > > If so I don’t believe that should be necessary provided you specify > > ordered=true for the clone. > > This allows you to assume in the agent that your instance is the only one > > currently changing state (by starting or stopping). > > I notice that rabbitmq.com explicitly sets this to false… any particular > > reason? > > > > > > Regarding the pcs command to create the resource, you can simplify it to: > > > > pcs resource create --force --master p_rabbitmq-server > > ocf:rabbitmq:rabbitmq-server-ha \ > > erlang_cookie=DPMDALGUKEOMPTHWPYKC node_port=5672 \ > > op monitor interval=30 timeout=60 \ > > op monitor interval=27 role=Master timeout=60 \ > > op monitor interval=103 role=Slave timeout=60 OCF_CHECK_LEVEL=30 \ > > meta notify=true ordered=false interleave=true master-max=1 > > master-node-max=1 > > > > If you update the stop/start/notify/promote/demote timeouts in the agent’s > > metadata. > > > > > > Lines 1602,1565,1621,1632,1657, and 1678 have the notify command returning > > an error. > > Was this logic tested? Because pacemaker does not currently support/allow > > notify actions to fail. > > IIRC pacemaker simply ignores them. > > > > Modifying the resource state in notifications is also highly unusual. > > What was the reason for that? > > > > I notice that on node down, this agent makes disconnect_node and > > forget_cluster_node calls. > > The other upstream agent does not, do you have any information about the > > bad things that might happen as a result? > > > > Basically I’m looking for what each option does differently/better with a > > view to converging on a single implementation. > > I don’t much care in which location it lives. > > > > I’m CC’ing the other upstream maintainer, it would be good if you guys > > could have a chat :-) > > > > > All running rabbit nodes may process AMQP connections. Master state is > > > only for a cluster initial point at wich other slaves may join to it. > > > Note, here you can find events flow charts as well [0] > > > [0] https://www.rabbitmq.com/pacemaker.html > > > Regards, > > > Bogdan > > > __________________________________________________________________________ > > > OpenStack Development Mailing List (not for usage questions) > > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > __________________________________________________________________________ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > > -- > > Yours Faithfully, > > Vladimir Kuklin, > > Fuel Library Tech Lead, > > Mirantis, Inc. > > +7 (495) 640-49-04 > > +7 (926) 702-39-68 > > Skype kuklinvv > > 35bk3, Vorontsovskaya Str. > > Moscow, Russia, > > www.mirantis.com > > www.mirantis.ru > > vkuk...@mirantis.com > > __________________________________________________________________________ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > -- > Yours Faithfully, > Vladimir Kuklin, > Fuel Library Tech Lead, > Mirantis, Inc. > +7 (495) 640-49-04 > +7 (926) 702-39-68 > Skype kuklinvv > 35bk3, Vorontsovskaya Str. > Moscow, Russia, > www.mirantis.com > www.mirantis.ru > vkuk...@mirantis.com > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev