Saw this message a bit later, i tried to break it down and respond..
On 10/19/15 2:24 AM, Ronald van Zantvoort wrote: > On 19/10/15 11:18, Ronald van Zantvoort wrote: >> On 16/10/15 00:21, ilya wrote: >>> I noticed several attempts to address the issue with KVM HA in Jira and >>> Dev ML. As we all know, there are many ways to solve the same problem, >>> on our side, we've given it some thought as well - and its on our to do >>> list. >>> >>> Specifically a mail thread "KVM HA is broken, let's fix it" >>> JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8943 >>> JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8643 >>> >>> We propose the following solution that in our understanding should cover >>> all use cases and provide a fencing mechanism. >>> >>> NOTE: Proposed IPMI fencing, is just a script. If you are using HP >>> hardware with ILO, it could be an ILO executable with specific >>> parameters. In theory - this can be *any* action script not just IPMI. >>> >>> Please take few minutes to read this through, to avoid duplicate >>> efforts... >>> >>> >>> Proposed FS below: >>> ---------------- >>> >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+HA+with+IPMI+Fencing >>> >>> >>> >> >> >> Hi Ilja, thanks for the design; I've put a comment int 8943, here it is >> verbatim as my 5c in the discussion: >> > > Well, that completely clobbered up the readability LOL > > Let's try again, but see > https://issues.apache.org/jira/browse/CLOUDSTACK-8943 for the better > markup ;) > > [~ilya.mailing.li...@gmail.com]: Thanks for the design document. I can't > comment in Confluence, so here goes: > > * When to fence; [~sweller]: Of course you're right that it should be > highly unlikely that your storage completely dissappears from the > cluster. Be that as it may, as you yourself note, first of all if you're > using NFS without HA that likelihood increases manyfold. Secondly, > defining it as an anlikely disastrous event seems no reason not to take > it into account; making it a catastrophic event by 'fencing' all > affected hypervisors will not serve anyone as it would be unexpected and > unwelcome. > * The entire concept of fencing exists to absolutely ensure state. > Specifically in this regard the state of the block devices and their > data. [~shadowsor]: For that same reason it's not reasonable to 'just > assume' VM's gone. There's a ton of failure domains that could cause an > agent to disconnect from the manager but still have the same VM's > running, and there's nothing stopping CloudStack from starting the same > VM twice on the same block devices, with desastrous results. That's why > you *need* to *know* the VM's are *very definitely* not running anymore, > which is exactly what fencing is supposed to do. > * For this, IPMI fencing is a nice and very often used option; > absolutely ensuring a hypervisor has died, and ergo the running VM's. It > will however not fix the case of the mass rebooting hypervisors (but > rather quite likely making it even more of an adventure if not addressed > properly) > > > Now, with all that in mind, I'd like to make the following comments > regarding [~ilya.mailing.li...@gmail.com] 's design. > > * First of the IPMI implementation: There's is IMHO no need to define > IPMI (Executable,Start,Stop,Reboot,Blink,Test). IPMI is a protocol, all > these are standard commands. For example, using the venerable `ipmitool` > gives you `chassis power (on,status,poweroff,identify,reset)` etc. which > will work on any IPMI device; only authentication details (User, Pass, > Proto) differ. There's bound to be some library that does it without > having to resort to (possibly numerous) different (versions of) external > binaries. > > * Secondly you're assuming that hypervisors can access the IPMI's of > their cluster/pod peers; although I'm not against this assumption per > sé, I'm also not convinced we're servicing everybody by forcing that > assumption to be true; some kind of IPMI agent/proxy comes to mind, or > even relegating the task back to the manager or some SystemVM. Also bear > in mind that you need access to those IPMI's to ensure cluster > functionality, so a failure domain should be in maintenance state if any > of the fence devices can't be reached > > * Thirdly your proposed testing algorithm needs more discussion; after > all, it directly hits the fundamental principal reasons for *why* to > fence a host, and that's a lot more than just 'these disks still gets > writes'. In fact, by the time you're checking this, you're probably > already assuming something's very wrong with the hypervisor, so why not > just fence it then? The decision to fence should lie with the first > notification that some is (very) wrong with the hypervisor, and only > limited attempts should be made to get it out. Say it can't reach it's > storage and that get's you your HA actions; why check for the disks > first? Try to get the storage back up like 3 times, or for 90 sec or so, > then fence the fucker and HA the VM's immediately after confirmation. In > fact, that's exactly what it's doing now, with the side note that > confirmation can only reasonably follow after the hypervisor is done > rebooting. > > * Finally as mentioned you're not solving the 'o look, my storage is > gone, let's fence' * (N) problem; in the case of a failing NFS: > ** Every host will start IPMI resetting every other hypervisor; by > then there's a good chance every hypervisor in all connected clusters > are rebooting, leaving a state where there's no hypervisors in the > cluster to fence others; that in turn should lead to the cluster falling > in maintenance state, which will lead to even more bells & whistles > going off. > ** They'll come back, find the NFS still gone, and continue resetting > each other like there's no tomorrow > ** Support staff already panicking over the NFS/network outage now has > to deal with entire clusters of hypervisors in perpetual reboot as well > as clusters which are completely unreachable because there's no one left > to check state; this all while the outage might simply require the > revert of some inadvertent network ACL snafu > Although I well understand [~sweller]'s concerns regarding agent > complexity in this regard, quorum is the standard way of solving that > problem. On the other hand, once the Agents start talking to each other > and the Manager over some standard messaging API/bus this problem might > well be solved for you; getting, say, Gossip or Paxos or any other > clustering/quorum protocol shouldn't be that hard considering the amount > of Java software already doing just that out there. > ** Another idea would be to introduce some other kind of storage > monitoring, for example by a SystemVM or something. > ** If you'll insist on the 'clusters fence themselves' paradigm, you > could maybe also introduce a constraint that a node is only allowed to > fence others if itself is healthy; ergo if it doesn't have all storages > available, it doesn't get to fence others whose storage isn't available. >