> >Well, is it already decided that Pacemaker would be chosen to provide HA in > >Openstack? There's been a talk "Pacemaker: the PID 1 of Openstack" IIRC. > > > >I know that Pacemaker's been pushed aside in an earlier ML post, but IMO > >there's already *so much* been done for HA in Pacemaker that Openstack > >should just use it. > > > >All HA nodes needs to participate in a Pacemaker cluster - and if one node > >looses connection, all services will get stopped automatically (by > >Pacemaker) - or the node gets fenced. > > > > > >No need to invent some sloppy scripts to do exactly the tasks (badly!) that > >the Linux HA Stack has been providing for quite a few years. > So just a piece of information, but yahoo (the company I work for, with vms > in the tens of thousands, baremetal in the much more than that...) hasn't > used pacemaker, and in all honesty this is the first project (openstack) > that I have heard that needs such a solution. I feel that we really should > be building our services better so that they can be A-A vs having to depend > on another piece of software to get around our 'sloppiness' (for lack of a > better word). > > Nothing against pacemaker personally... IMHO it just doesn't feel like we > are doing this right if we need such a product in the first place. Well, Pacemaker is *the* Linux HA Stack.
So, before trying to achieve similar goals by self-written scripts (and having to re-discover all the gotchas involved), it would be much better to learn from previous experiences - even if they are not one's own. Pacemaker has eg. the concept of clones[1] - these define services that run multiple instances within a cluster. And behold! the instances get some Pacemaker-internal unique id[2], which can be used to do sharding. Yes, that still means that upon service or node crash the failed instance has to be started on some other node; but as that'll typically be up and running already, the startup time should be in the range of seconds. We'd instantly get * a supervisor to start/stop/restart/fence/monitor the service(s) * node/service failure detection * only small changes needed in the services * and all that in a tested software that's available in all distributions, and that already has its own testsuite... If we decide that this solution won't fulfill all our expectations, fine - let's use something else. But I don't think it makes *any* sense to try to redo some (existing) High-Availability code in some quickly written scripts, just because it looks easy - there are quite a few traps for the unwary. Ad 1: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-clone.html Ad 2: OCF_RESKEY_CRM_meta_clone; that's not guaranteed to be an unbroken sequence, though. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev