Thanks so much for the info - we'll look at that line also! I'll let you know when we create a PR for our changes - in case you want to review them for your environment
-----Original Message----- From: Andrija Panic [mailto:andrija.pa...@gmail.com] Sent: Tuesday, February 20, 2018 5:16 PM To: dev <dev@cloudstack.apache.org> Subject: Re: HA issues That is good to hear ( no NFS issues causing Agent Disconnect). I assume you are using "normal" NFS solution with proper HA and no ZFS (kernel panic etc), but anyway be aware of this one https://github.com/apache/cloudstack/blob/e532b574ddb186a117da638fb6059356fe7c266c/scripts/vm/hypervisor/kvm/kvmheartbeat.sh#L161 we used to comment this line, because we did have some issues with communication link, and this commented line saved our a$$ few times :) CHeers On 20 February 2018 at 20:50, Sean Lair <sl...@ippathways.com> wrote: > Hi Andrija > > We are currently running XenServer in production. We are working on > moving to KVM and have it deployed in a development environment. > > The team is putting CloudStack + KVM through its paces and that is > when it was discovered how broken VM HA is in 4.9.3. Initially our > patches fixed VM HA, but just caused VMs to get started on two hosts > during failure testing. The libvirt lockd has solved that issue thus far. > > Short answer to you question is :-), we were not having problems with > Agent Disconnects in a production environment. It was our testing/QA > that revealed the issues. Our NFS has been stable so far, no issues > with the agent crashing/stopping that wasn't initiated by the team's testing. > > Thanks > Sean > > > -----Original Message----- > From: Andrija Panic [mailto:andrija.pa...@gmail.com] > Sent: Saturday, February 17, 2018 1:49 PM > To: dev <dev@cloudstack.apache.org> > Subject: Re: HA issues > > Hi Sean, > > (we have 2 threads interleaving on the libvirt lockd..) - so, did you > manage to understand what can cause the Agent Disconnect in most > cases, for you specifically? Is there any software (CloudStack) root > cause (disregarding i.e. networking issues etc) > > Just our examples, which you should probably not have: > > We had CEPH cluster running (with ACS), and there any exception in > librbd would crash JVM and the agent, but this has been fixed mostly - > Now get i.e. agent disconnect when ACS try to delete volume on CEPH > (and for some reason not succeed withing 30 minutes, volume deletion > fails) - then libvirt get's completety stuck (virsh list even dont > work)...so agent get's disconnect eventually. > > It would be good to get rid of agent disconnections in general, > obviously > :) so that is why I'm asking (you are on NFS, so would like to see > your experience here). > > Thanks > > On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote: > > > We were in the same situation as Nux. > > > > In our test environment we hit the issue with VMs not getting fenced and > > coming up on two hosts because of VM HA. However, we updated some of > the > > logic for VM HA and turned on libvirtd's locking mechanism. Now we > > are working great w/o IPMI. The locking stops the VMs from starting > > elsewhere, and everything recovers very nicely when the host starts > responding again. > > > > We are on 4.9.3 and haven't started testing with 4.11 yet, but it > > may work along-side IPMI just fine - it would just have affect the fencing. > > However, we *currently* prefer how we are doing it now, because if > > the agent stops responding, but the host is still up, the VMs > > continue running and no actual downtime is incurred. Even when VM > > HA attempts to power on the VMs on another host, it just fails the > > power-up and the VMs continue to run on the "agent disconnected" > > host. The host goes into alarm state and our NOC can look into what > > is wrong the agent on the host. If IPMI was enabled, it sounds like > > it would power off the host (fence) and force downtime for us even > > if the VMs were actually running OK - and just the agent is unreachable. > > > > I plan on submitting our updates via a pull request at some point. > > But I can also send the updated code to anyone that wants to do some > > testing before then. > > > > -----Original Message----- > > From: Marcus [mailto:shadow...@gmail.com] > > Sent: Friday, February 16, 2018 11:27 AM > > To: dev@cloudstack.apache.org > > Subject: Re: HA issues > > > > From your other emails it sounds as though you do not have IPMI > > configured, nor host HA enabled, correct? In this case, the correct > > thing to do is nothing. If CloudStack cannot guarantee the VM state > > (as is the case with an unreachable hypervisor), it should do > > nothing, for fear of causing a split brain and corrupting the VM > > disk (VM running > on two hosts). > > > > Clustering and fencing is a tricky proposition. When CloudStack (or > > any other cluster manager) is not configured to or cannot guarantee > > state then things will simply lock up, in this case your HA VM on > > your broken hypervisor will not run elsewhere. This has been the > > case for a long time with CloudStack, HA would only start a VM after > > the original hypervisor agent came back and reported no VM is running. > > > > The new feature, from what I gather, simply adds the possibility of > > CloudStack being able to reach out and shut down the hypervisor to > > guarantee state. At that point it can start the VM elsewhere. If > > something fails in that process (IPMI unreachable, for example, or > > bad credentials), you're still going to be stuck with a VM not coming back. > > > > It's the nature of the thing. I'd be wary of any HA solution that > > does not reach out and guarantee state via host or storage fencing > > before starting a VM elsewhere, as it will be making assumptions. > > Its entirely possible a VM might be unreachable or unable to access > > it storage for a short while, a new instance of the VM is started > elsewhere, and the original VM comes back. > > > > On Wed, Jan 17, 2018 at 9:02 AM Nux! <n...@li.nux.ro> wrote: > > > > > Hi Rohit, > > > > > > I've reinstalled and tested. Still no go with VM HA. > > > > > > What I did was to kernel panic that particular HV ("echo c > > > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash). > > > What happened next is the HV got marked as "Alert", the VM on it > > > was all the time marked as "Running" and it was not migrated to another > > > HV. > > > Once the panicked HV has booted back the VM reboots and becomes > > available. > > > > > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary > storage. > > > The VM has HA enabled service offering. > > > Host HA or OOBM configuration was not touched. > > > > > > Full log http://tmp.nux.ro/W3s-management-server.log > > > > > > -- > > > Sent from the Delta quadrant using Borg technology! > > > > > > Nux! > > > www.nux.ro > > > > > > ----- Original Message ----- > > > > From: "Rohit Yadav" <rohit.ya...@shapeblue.com> > > > > To: "dev" <dev@cloudstack.apache.org> > > > > Sent: Wednesday, 17 January, 2018 12:13:33 > > > > Subject: Re: HA issues > > > > > > > I performed VM HA sanity checks and was not able to reproduce > > > > any > > > regression > > > > against two KVM CentOS7 hosts in a cluster. > > > > > > > > > > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on > > > > a KVM > > > host2 and > > > > killed it (powered off). After few minutes of CloudStack > > > > attempting to > > > find why > > > > the host (kvm agent) timed out, CloudStack kicked investigators, > > > > that eventually led KVM fencers to work and VM HA job kicked to > > > > start those > > > few VMs > > > > on host1 and the KVM host2 was put to "Down" state. > > > > > > > > > > > > - Rohit > > > > > > > > <https://cloudstack.apache.org> > > > > > > > > > > > > > > > > ________________________________ > > > > > > > > rohit.ya...@shapeblue.com > > > > www.shapeblue.com > > > > 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue > > > > > > > > > > > > > > > > From: Rohit Yadav > > > > Sent: Wednesday, January 17, 2018 2:39:19 PM > > > > To: dev > > > > Subject: Re: HA issues > > > > > > > > > > > > Hi Lucian, > > > > > > > > > > > > The "Host HA" feature is entirely different from VM HA, however, > > > > they > > > may work > > > > in tandem, so please stop using the terms interchangeably as it > > > > may > > > cause the > > > > community to believe a regression has been caused. > > > > > > > > > > > > The "Host HA" feature currently ships with only "Host HA" > > > > provider for > > > KVM that > > > > is strictly tied to out-of-band management (IPMI for fencing, > > > > i.e power > > > off and > > > > recovery, i.e. reboot) and NFS (as primary storage). (We also > > > > have a > > > provider > > > > for simulator, but that's for coverage/testing purposes). > > > > > > > > > > > > Therefore, "Host HA" for KVM (+nfs) currently works only when > > > > OOBM is > > > enabled. > > > > The frameowkr allows interested parties may write their own HA > > providers > > > for a > > > > hypervisor that can use a different strategy/mechanism for > > > fencing/recovery of > > > > hosts (including write a non-IPMI based OOBM plugin) and > > > > host/disk > > > activity > > > > checker that is non-NFS based. > > > > > > > > > > > > The "Host HA" feature ships disabled by default and does not > > > > cause any interference with VM HA. However, when enabled and > > > > configured > > correctly, > > > it is > > > > a known limitation that when it is unable to successfully > > > > perform > > > recovery or > > > > fencing tasks it may not trigger VM HA. We can discuss how to > > > > handle > > > such cases > > > > (thoughts?). "Host HA" would try couple of times to recover and > > > > failing > > > to do > > > > so, it would eventually trigger a host fencing task. If it's > > > > unable to > > > fence a > > > > host, it will indefinitely attempt to fence the host (the host > > > > state > > > will be > > > > stuck at fencing state in cloud.ha_config table for example) and > > > > alerts > > > will be > > > > sent to admin who can do some manual intervention to handle such > > > situations (if > > > > you've email/smtp enabled, you should see alert emails). > > > > > > > > > > > > We can discuss how to improve and have a workaround for the case > > > > you've > > > hit, > > > > thanks for sharing. > > > > > > > > > > > > - Rohit > > > > > > > > ________________________________ > > > > From: Nux! <n...@li.nux.ro> > > > > Sent: Tuesday, January 16, 2018 10:42:35 PM > > > > To: dev > > > > Subject: Re: HA issues > > > > > > > > Ok, reinstalled and re-tested. > > > > > > > > What I've learned: > > > > > > > > - HA only works now if OOB is configured, the old way HA no > > > > longer > > > applies - > > > > this can be good and bad, not everyone has IPMIs > > > > > > > > - HA only works if IPMI is reachable. I've pulled the cord on a > > > > HV and > > > HA failed > > > > to do its thing, leaving me with a HV down along with all the > > > > VMs > > running > > > > there. That's bad. > > > > I've opened this ticket for it: > > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234 > > > > > > > > Let me know if you need any extra info or stuff to test. > > > > > > > > Regards, > > > > Lucian > > > > > > > > -- > > > > Sent from the Delta quadrant using Borg technology! > > > > > > > > Nux! > > > > www.nux.ro > > > > > > > > ----- Original Message ----- > > > >> From: "Nux!" <n...@li.nux.ro> > > > >> To: "dev" <dev@cloudstack.apache.org> > > > >> Sent: Tuesday, 16 January, 2018 11:35:58 > > > >> Subject: Re: HA issues > > > > > > > >> I'll reinstall my setup and try again, just to be sure I'm > > > >> working on > > a > > > clean > > > >> slate. > > > >> > > > >> -- > > > >> Sent from the Delta quadrant using Borg technology! > > > >> > > > >> Nux! > > > >> www.nux.ro > > > >> > > > >> ----- Original Message ----- > > > >>> From: "Rohit Yadav" <rohit.ya...@shapeblue.com> > > > >>> To: "dev" <dev@cloudstack.apache.org> > > > >>> Sent: Tuesday, 16 January, 2018 11:29:51 > > > >>> Subject: Re: HA issues > > > >> > > > >>> Hi Lucian, > > > >>> > > > >>> > > > >>> If you're talking about the new HostHA feature (with > > > >>> KVM+nfs+ipmi), > > > please refer > > > >>> to following docs: > > > >>> > > > >>> > > > http://docs.cloudstack.apache.org/projects/cloudstack- > > administration/en/latest/hosts.html#out-of-band-management > > > >>> > > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA > > > >>> > > > >>> > > > >>> We'll need to you look at logs perhaps create a JIRA ticket > > > >>> with the > > > logs and > > > >>> details? If you saw ipmi based reboot, then host-ha indeed > > > >>> tried to > > > recover > > > >>> i.e. reboot the host, once hostha has done its work it would > > > >>> schedule > > > HA for VM > > > >>> as soon as the recovery operation succeeds (we've simulator > > > >>> and kvm > > > based > > > >>> marvin tests for such scenarios). > > > >>> > > > >>> > > > >>> Can you see it making attempt to schedule VM ha in logs, or > > > >>> any > > > failure? > > > >>> > > > >>> > > > >>> - Rohit > > > >>> > > > >>> <https://cloudstack.apache.org> > > > >>> > > > >>> > > > >>> > > > >>> ________________________________ > > > >>> From: Nux! <n...@li.nux.ro> > > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM > > > >>> To: dev > > > >>> Subject: [4.11] HA issues > > > >>> > > > >>> Hi, > > > >>> > > > >>> I see there's a new HA engine for KVM and IPMI support which > > > >>> is > > really > > > nice, > > > >>> however it seems hit and miss. > > > >>> I have created an instance with HA offering, kernel panicked > > > >>> one of > > the > > > >>> hypervisors - after a while the server was rebooted via IPMI > > probably, > > > but the > > > >>> instance never moved to a running hypervisor and even after > > > >>> the o > > > <https://maps.google.com/?q=to+a+running+hypervisor+and+ > > even+after+the+o&entry=gmail&source=g> > > > riginal > > > >>> hypervisor came back it was still left in Stopped state. > > > >>> Is there any extra things I need to set up to have proper HA? > > > >>> > > > >>> Regards, > > > >>> Lucian > > > >>> > > > >>> -- > > > >>> Sent from the Delta quadrant using Borg technology! > > > >>> > > > >>> Nux! > > > >>> www.nux.ro > > > >>> > > > >>> rohit.ya...@shapeblue.com > > > >>> www.shapeblue.com<http://www.shapeblue.com> > > > >>> 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > > > > > @shapeblue > > > > > > > > > -- > > Andrija Panić > -- Andrija Panić