Re: [openstack-dev] [Cinder] A possible solution for HA Active-Active

Philipp Marek Wed, 05 Aug 2015 00:13:05 -0700

> >Well, is it already decided that Pacemaker would be chosen to provide HA in
> >Openstack? There's been a talk "Pacemaker: the PID 1 of Openstack" IIRC.
> >
> >I know that Pacemaker's been pushed aside in an earlier ML post, but IMO
> >there's already *so much* been done for HA in Pacemaker that Openstack
> >should just use it.
> >
> >All HA nodes needs to participate in a Pacemaker cluster - and if one node
> >looses connection, all services will get stopped automatically (by
> >Pacemaker) - or the node gets fenced.
> >
> >
> >No need to invent some sloppy scripts to do exactly the tasks (badly!) that
> >the Linux HA Stack has been providing for quite a few years.
> So just a piece of information, but yahoo (the company I work for, with vms
> in the tens of thousands, baremetal in the much more than that...) hasn't
> used pacemaker, and in all honesty this is the first project (openstack)
> that I have heard that needs such a solution. I feel that we really should
> be building our services better so that they can be A-A vs having to depend
> on another piece of software to get around our 'sloppiness' (for lack of a
> better word).
> 
> Nothing against pacemaker personally... IMHO it just doesn't feel like we
> are doing this right if we need such a product in the first place.
Well, Pacemaker is *the* Linux HA Stack.

So, before trying to achieve similar goals by self-written scripts (and
having to re-discover all the gotchas involved), it would be much better to
learn from previous experiences - even if they are not one's own.

Pacemaker has eg. the concept of clones[1] - these define services that run
multiple instances within a cluster. And behold! the instances get some
Pacemaker-internal unique id[2], which can be used to do sharding.

Yes, that still means that upon service or node crash the failed instance
has to be started on some other node; but as that'll typically be up and
running already, the startup time should be in the range of seconds.

We'd instantly get
* a supervisor to start/stop/restart/fence/monitor the service(s)
* node/service failure detection
* only small changes needed in the services
* and all that in a tested software that's available in all distributions,
and that already has its own testsuite...

If we decide that this solution won't fulfill all our expectations, fine -
let's use something else.

But I don't think it makes *any* sense to try to redo some (existing)
High-Availability code in some quickly written scripts, just because it
looks easy - there are quite a few traps for the unwary.

Ad 1:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-clone.html
Ad 2: OCF_RESKEY_CRM_meta_clone; that's not guaranteed to be an unbroken
sequence, though.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Cinder] A possible solution for HA Active-Active

Reply via email to