Re: HA issues

Andrija Panic Sat, 17 Feb 2018 11:49:46 -0800

Hi Sean,

(we have 2 threads interleaving on the libvirt lockd..) - so, did you
manage to understand what can cause the Agent Disconnect in most cases, for
you specifically? Is there any software (CloudStack) root cause
(disregarding i.e. networking issues etc)


Just our examples, which you should probably not have:

We had CEPH cluster running (with ACS), and there any exception in librbd
would crash JVM and the agent, but this has been fixed mostly -
Now get i.e. agent disconnect when ACS try to delete volume on CEPH (and
for some reason not succeed withing 30 minutes, volume deletion fails) -
then libvirt get's completety stuck (virsh list even dont work)...so  agent
get's disconnect eventually.

It would be good to get rid of agent disconnections in general, obviously
:) so that is why I'm asking (you are on NFS, so would like to see your
experience here).

Thanks

On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:

> We were in the same situation as Nux.
>
> In our test environment we hit the issue with VMs not getting fenced and
> coming up on two hosts because of VM HA.   However, we updated some of the
> logic for VM HA and turned on libvirtd's locking mechanism.  Now we are
> working great w/o IPMI.  The locking stops the VMs from starting elsewhere,
> and everything recovers very nicely when the host starts responding again.
>
> We are on 4.9.3 and haven't started testing with 4.11 yet, but it may work
> along-side IPMI just fine - it would just have affect the fencing.
> However, we *currently* prefer how we are doing it now, because if the
> agent stops responding, but the host is still up, the VMs continue running
> and no actual downtime is incurred.  Even when VM HA attempts to power on
> the VMs on another host, it just fails the power-up and the VMs continue to
> run on the "agent disconnected" host. The host goes into alarm state and
> our NOC can look into what is wrong the agent on the host.  If IPMI was
> enabled, it sounds like it would power off the host (fence) and force
> downtime for us even if the VMs were actually running OK - and just the
> agent is unreachable.
>
> I plan on submitting our updates via a pull request at some point.  But I
> can also send the updated code to anyone that wants to do some testing
> before then.
>
> -----Original Message-----
> From: Marcus [mailto:shadow...@gmail.com]
> Sent: Friday, February 16, 2018 11:27 AM
> To: dev@cloudstack.apache.org
> Subject: Re: HA issues
>
> From your other emails it sounds as though you do not have IPMI
> configured, nor host HA enabled, correct? In this case, the correct thing
> to do is nothing. If CloudStack cannot guarantee the VM state (as is the
> case with an unreachable hypervisor), it should do nothing, for fear of
> causing a split brain and corrupting the VM disk (VM running on two hosts).
>
> Clustering and fencing is a tricky proposition. When CloudStack (or any
> other cluster manager) is not configured to or cannot guarantee state then
> things will simply lock up, in this case your HA VM on your broken
> hypervisor will not run elsewhere. This has been the case for a long time
> with CloudStack, HA would only start a VM after the original hypervisor
> agent came back and reported no VM is running.
>
> The new feature, from what I gather, simply adds the possibility of
> CloudStack being able to reach out and shut down the hypervisor to
> guarantee state. At that point it can start the VM elsewhere. If something
> fails in that process (IPMI unreachable, for example, or bad credentials),
> you're still going to be stuck with a VM not coming back.
>
> It's the nature of the thing. I'd be wary of any HA solution that does not
> reach out and guarantee state via host or storage fencing before starting a
> VM elsewhere, as it will be making assumptions. Its entirely possible a VM
> might be unreachable or unable to access it storage for a short while, a
> new instance of the VM is started elsewhere, and the original VM comes back.
>
> On Wed, Jan 17, 2018 at 9:02 AM Nux! <n...@li.nux.ro> wrote:
>
> > Hi Rohit,
> >
> > I've reinstalled and tested. Still no go with VM HA.
> >
> > What I did was to kernel panic that particular HV ("echo c >
> > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > What happened next is the HV got marked as "Alert", the VM on it was
> > all the time marked as "Running" and it was not migrated to another HV.
> > Once the panicked HV has booted back the VM reboots and becomes
> available.
> >
> > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage.
> > The VM has HA enabled service offering.
> > Host HA or OOBM configuration was not touched.
> >
> > Full log http://tmp.nux.ro/W3s-management-server.log
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> > ----- Original Message -----
> > > From: "Rohit Yadav" <rohit.ya...@shapeblue.com>
> > > To: "dev" <dev@cloudstack.apache.org>
> > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > Subject: Re: HA issues
> >
> > > I performed VM HA sanity checks and was not able to reproduce any
> > regression
> > > against two KVM CentOS7 hosts in a cluster.
> > >
> > >
> > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a
> > > KVM
> > host2 and
> > > killed it (powered off). After few minutes of CloudStack attempting
> > > to
> > find why
> > > the host (kvm agent) timed out, CloudStack kicked investigators,
> > > that eventually led KVM fencers to work and VM HA job kicked to
> > > start those
> > few VMs
> > > on host1 and the KVM host2 was put to "Down" state.
> > >
> > >
> > > - Rohit
> > >
> > > <https://cloudstack.apache.org>
> > >
> > >
> > >
> > > ________________________________
> > >
> > > rohit.ya...@shapeblue.com
> > > www.shapeblue.com
> > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > >
> > >
> > >
> > > From: Rohit Yadav
> > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > To: dev
> > > Subject: Re: HA issues
> > >
> > >
> > > Hi Lucian,
> > >
> > >
> > > The "Host HA" feature is entirely different from VM HA, however, they
> > may work
> > > in tandem, so please stop using the terms interchangeably as it may
> > cause the
> > > community to believe a regression has been caused.
> > >
> > >
> > > The "Host HA" feature currently ships with only "Host HA" provider for
> > KVM that
> > > is strictly tied to out-of-band management (IPMI for fencing, i.e power
> > off and
> > > recovery, i.e. reboot) and NFS (as primary storage). (We also have a
> > provider
> > > for simulator, but that's for coverage/testing purposes).
> > >
> > >
> > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is
> > enabled.
> > > The frameowkr allows interested parties may write their own HA
> providers
> > for a
> > > hypervisor that can use a different strategy/mechanism for
> > fencing/recovery of
> > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > activity
> > > checker that is non-NFS based.
> > >
> > >
> > > The "Host HA" feature ships disabled by default and does not cause any
> > > interference with VM HA. However, when enabled and configured
> correctly,
> > it is
> > > a known limitation that when it is unable to successfully perform
> > recovery or
> > > fencing tasks it may not trigger VM HA. We can discuss how to handle
> > such cases
> > > (thoughts?). "Host HA" would try couple of times to recover and failing
> > to do
> > > so, it would eventually trigger a host fencing task. If it's unable to
> > fence a
> > > host, it will indefinitely attempt to fence the host (the host state
> > will be
> > > stuck at fencing state in cloud.ha_config table for example) and alerts
> > will be
> > > sent to admin who can do some manual intervention to handle such
> > situations (if
> > > you've email/smtp enabled, you should see alert emails).
> > >
> > >
> > > We can discuss how to improve and have a workaround for the case you've
> > hit,
> > > thanks for sharing.
> > >
> > >
> > > - Rohit
> > >
> > > ________________________________
> > > From: Nux! <n...@li.nux.ro>
> > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > To: dev
> > > Subject: Re: HA issues
> > >
> > > Ok, reinstalled and re-tested.
> > >
> > > What I've learned:
> > >
> > > - HA only works now if OOB is configured, the old way HA no longer
> > applies -
> > > this can be good and bad, not everyone has IPMIs
> > >
> > > - HA only works if IPMI is reachable. I've pulled the cord on a HV and
> > HA failed
> > > to do its thing, leaving me with a HV down along with all the VMs
> running
> > > there. That's bad.
> > > I've opened this ticket for it:
> > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > >
> > > Let me know if you need any extra info or stuff to test.
> > >
> > > Regards,
> > > Lucian
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > >> From: "Nux!" <n...@li.nux.ro>
> > >> To: "dev" <dev@cloudstack.apache.org>
> > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > >> Subject: Re: HA issues
> > >
> > >> I'll reinstall my setup and try again, just to be sure I'm working on
> a
> > clean
> > >> slate.
> > >>
> > >> --
> > >> Sent from the Delta quadrant using Borg technology!
> > >>
> > >> Nux!
> > >> www.nux.ro
> > >>
> > >> ----- Original Message -----
> > >>> From: "Rohit Yadav" <rohit.ya...@shapeblue.com>
> > >>> To: "dev" <dev@cloudstack.apache.org>
> > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > >>> Subject: Re: HA issues
> > >>
> > >>> Hi Lucian,
> > >>>
> > >>>
> > >>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
> > please refer
> > >>> to following docs:
> > >>>
> > >>>
> > http://docs.cloudstack.apache.org/projects/cloudstack-
> administration/en/latest/hosts.html#out-of-band-management
> > >>>
> > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > >>>
> > >>>
> > >>> We'll need to you look at logs perhaps create a JIRA ticket with the
> > logs and
> > >>> details? If you saw ipmi based reboot, then host-ha indeed tried to
> > recover
> > >>> i.e. reboot the host, once hostha has done its work it would schedule
> > HA for VM
> > >>> as soon as the recovery operation succeeds (we've simulator and kvm
> > based
> > >>> marvin tests for such scenarios).
> > >>>
> > >>>
> > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > failure?
> > >>>
> > >>>
> > >>> - Rohit
> > >>>
> > >>> <https://cloudstack.apache.org>
> > >>>
> > >>>
> > >>>
> > >>> ________________________________
> > >>> From: Nux! <n...@li.nux.ro>
> > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > >>> To: dev
> > >>> Subject: [4.11] HA issues
> > >>>
> > >>> Hi,
> > >>>
> > >>> I see there's a new HA engine for KVM and IPMI support which is
> really
> > nice,
> > >>> however it seems hit and miss.
> > >>> I have created an instance with HA offering, kernel panicked one of
> the
> > >>> hypervisors - after a while the server was rebooted via IPMI
> probably,
> > but the
> > >>> instance never moved to a running hypervisor and even after the o
> > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> even+after+the+o&entry=gmail&source=g>
> > riginal
> > >>> hypervisor came back it was still left in Stopped state.
> > >>> Is there any extra things I need to set up to have proper HA?
> > >>>
> > >>> Regards,
> > >>> Lucian
> > >>>
> > >>> --
> > >>> Sent from the Delta quadrant using Borg technology!
> > >>>
> > >>> Nux!
> > >>> www.nux.ro
> > >>>
> > >>> rohit.ya...@shapeblue.com
> > >>> www.shapeblue.com<http://www.shapeblue.com>
> > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue
> >
>



-- 

Andrija Panić

Re: HA issues

Reply via email to