Saw this message a bit later, i tried to break it down and respond..

On 10/19/15 2:24 AM, Ronald van Zantvoort wrote:
> On 19/10/15 11:18, Ronald van Zantvoort wrote:
>> On 16/10/15 00:21, ilya wrote:
>>> I noticed several attempts to address the issue with KVM HA in Jira and
>>> Dev ML. As we all know, there are many ways to solve the same problem,
>>> on our side, we've given it some thought as well - and its on our to do
>>> list.
>>>
>>> Specifically a mail thread "KVM HA is broken, let's fix it"
>>> JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
>>> JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8643
>>>
>>> We propose the following solution that in our understanding should cover
>>> all use cases and provide a fencing mechanism.
>>>
>>> NOTE: Proposed IPMI fencing, is just a script. If you are using HP
>>> hardware with ILO, it could be an ILO executable with specific
>>> parameters. In theory - this can be *any* action script not just IPMI.
>>>
>>> Please take few minutes to read this through, to avoid duplicate
>>> efforts...
>>>
>>>
>>> Proposed FS below:
>>> ----------------
>>>
>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+HA+with+IPMI+Fencing
>>>
>>>
>>>
>>
>>
>> Hi Ilja, thanks for the design; I've put a comment int 8943, here it is
>> verbatim as my 5c in the discussion:
>>
> 
> Well, that completely clobbered up the readability LOL
> 
> Let's try again, but see
> https://issues.apache.org/jira/browse/CLOUDSTACK-8943 for the better
> markup ;)
> 
> [~ilya.mailing.li...@gmail.com]: Thanks for the design document. I can't
> comment in Confluence, so here goes:
> 
> * When to fence; [~sweller]: Of course you're right that it should be
> highly unlikely that your storage completely dissappears from the
> cluster. Be that as it may, as you yourself note, first of all if you're
> using NFS without HA that likelihood increases manyfold. Secondly,
> defining it as an anlikely disastrous event seems no reason not to take
> it into account; making it a catastrophic event by 'fencing' all
> affected hypervisors will not serve anyone as it would be unexpected and
> unwelcome.
> * The entire concept of fencing exists to absolutely ensure state.
> Specifically in this regard the state of the block devices and their
> data. [~shadowsor]: For that same reason it's not reasonable to 'just
> assume' VM's gone. There's a ton of failure domains that could cause an
> agent to disconnect from the manager but still have the same VM's
> running, and there's nothing stopping CloudStack from starting the same
> VM twice on the same block devices, with desastrous results. That's why
> you *need* to *know* the VM's are *very definitely* not running anymore,
> which is exactly what fencing is supposed to do.
> * For this, IPMI fencing is a nice and very often used option;
> absolutely ensuring a hypervisor has died, and ergo the running VM's. It
> will however not fix the case of the mass rebooting hypervisors (but
> rather quite likely making it even more of an adventure if not addressed
> properly)
> 
> 
> Now, with all that in mind, I'd like to make the following comments
> regarding [~ilya.mailing.li...@gmail.com] 's design.
> 
> * First of the IPMI implementation: There's is IMHO no need to define
> IPMI (Executable,Start,Stop,Reboot,Blink,Test). IPMI is a protocol, all
> these are standard commands. For example, using the venerable `ipmitool`
> gives you `chassis power (on,status,poweroff,identify,reset)` etc. which
> will work on any IPMI device; only authentication details (User, Pass,
> Proto) differ. There's bound to be some library that does it without
> having to resort to (possibly numerous) different (versions of) external
> binaries.
> 
> * Secondly you're assuming that hypervisors can access the IPMI's of
> their cluster/pod peers; although I'm not against this assumption per
> sé, I'm also not convinced we're servicing everybody by forcing that
> assumption to be true; some kind of IPMI agent/proxy comes to mind, or
> even relegating the task back to the manager or some SystemVM. Also bear
> in mind that you need access to those IPMI's to ensure cluster
> functionality, so a failure domain should be in maintenance state if any
> of the fence devices can't be reached
> 
> * Thirdly your proposed testing algorithm needs more discussion; after
> all, it directly hits the fundamental principal reasons for *why* to
> fence a host, and that's a lot more than just 'these disks still gets
> writes'. In fact, by the time you're checking this, you're probably
> already assuming something's very wrong with the hypervisor, so why not
> just fence it then? The decision to fence should lie with the first
> notification that some is (very) wrong with the hypervisor, and only
> limited attempts should be made to get it out. Say it can't reach it's
> storage and that get's you your HA actions; why check for the disks
> first? Try to get the storage back up like 3 times, or for 90 sec or so,
> then fence the fucker and HA the VM's immediately after confirmation. In
> fact, that's exactly what it's doing now, with the side note that
> confirmation can only reasonably follow after the hypervisor is done
> rebooting.
> 
> * Finally as mentioned you're not solving the 'o look, my storage is
> gone, let's fence' * (N) problem; in the case of a failing NFS:
>   ** Every host will start IPMI resetting every other hypervisor; by
> then there's a good chance every hypervisor in all connected clusters
> are rebooting, leaving a state where there's no hypervisors in the
> cluster to fence others; that in turn should lead to the cluster falling
> in maintenance state, which will lead to even more bells & whistles
> going off.
>   ** They'll come back, find the NFS still gone, and continue resetting
> each other like there's no tomorrow
>   ** Support staff already panicking over the NFS/network outage now has
> to deal with entire clusters of hypervisors in perpetual reboot as well
> as clusters which are completely unreachable because there's no one left
> to check state; this all while the outage might simply require the
> revert of some inadvertent network ACL snafu
> Although I well understand [~sweller]'s concerns regarding agent
> complexity in this regard, quorum is the standard way of solving that
> problem. On the other hand, once the Agents start talking to each other
> and the Manager over some standard messaging API/bus this problem might
> well be solved for you; getting, say, Gossip or Paxos or any other
> clustering/quorum protocol shouldn't be that hard considering the amount
> of Java software already doing just that out there.
>   ** Another idea would be to introduce some other kind of storage
> monitoring, for example by a SystemVM or something.
> ** If you'll insist on the 'clusters fence themselves' paradigm, you
> could maybe also introduce a constraint that a node is only allowed to
> fence others if itself is healthy; ergo if it doesn't have all storages
> available, it doesn't get to fence others whose storage isn't available.
> 

Reply via email to