Re: HA issues

Andrija Panic Mon, 19 Feb 2018 07:22:01 -0800

HI Again Simon,

thanks for these, we also had something commited (actually the whole RBD
snap deletion logic on CEPH side, which was initially missing):
https://github.com/apache/cloudstack/pull/1230/commits and some stuff were
also handled here.


But what we have here, is, afaik a new case, where customer try to delete
large volume on CEPH (4TB in our case, or a bit smaller - happened a few
times), then this takes time (whatever reason...) - this is during the
actual delete command sent from MGMT to AGENT, so not "lazy delete" with
later purge thread) - the deletion process itself timeout after 30minutes
(1800sec - I guess this is the default "wait" global parameter) and after
this libvirt just hanges (kill -9 is the only way to restart libvirt)

i.e: delete volume sent 14

2018-02-14 15:20:53,032 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Trying to fetch storage pool
8457c284-cf5d-3979-b82e-32ea5efeb97b from libvirt
2018-02-14 15:20:53,032 DEBUG [kvm.resource.LibvirtConnection]
(agentRequest-Handler-5:null) Looking for libvirtd connection at:
qemu:///system
2018-02-14 15:20:53,041 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully refreshed pool
8457c284-cf5d-3979-b82e-32ea5efeb97b Capacity: 235312757125120 Used:
44773027414768 Available: 99561505730560
2018-02-14 15:20:53,190 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Attempting to remove volume
84c12d6f-7536-429a-8994-1b860446b672 from pool
8457c284-cf5d-3979-b82e-32ea5efeb97b
2018-02-14 15:20:53,190 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
image cold-storage/84c12d6f-7536-429a-8994-1b860446b672 prior to removing
the image
2018-02-14 15:20:53,202 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
mon.xxxxyyyy.local:6789
2018-02-14 15:20:53,216 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Fetching list of snapshots of RBD image
cold-storage/84c12d6f-7536-429a-8994-1b860446b672
2018-02-14 15:20:53,224 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully unprotected and removed any
snapshots of cold-storage/84c12d6f-7536-429a-8994-1b860446b672 Continuing
to remove the RBD image
2018-02-14 15:20:53,228 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully closed rbd image and destroyed io
context.
2018-02-14 15:20:53,229 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Instructing libvirt to remove volume
84c12d6f-7536-429a-8994-1b860446b672 from pool
8457c284-cf5d-3979-b82e-32ea5efeb97b

then 30 minutes later, timeout.

2018-02-14 15:50:53,030 WARN  [c.c.a.m.AgentAttache]
(catalina-exec-4:ctx-468724bd ctx-b9984210) (logid:d23624d1) Seq
16-3001086201689154455: *Timed out on Seq 16-3001086201689154455*:  { Cmd ,
MgmtId: 90520740254323, via: 16(eq4-c2-2), Ver: v1, Flags: 100011,
[{"org.apache.cloudstack.storage.command.DeleteCommand":{"data":{"org.apache.cloudstack.storage.to.VolumeObjectTO":{"uuid":"84c12d6f-7536-429a-8994-1b860446b672","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"8457c284-cf5d-3979-b82e-32ea5efeb97b","id":1,"poolType":"RBD","host":"mon.xxxxyyyy.local","path":"cold-storage","port":6789,"url":"RBD://mon.xxxxyyyy.local/cold-storage/?ROLE=Primary&STOREUUID=8457c284-cf5d-3979-b82e-32ea5efeb97b"}},"name":"PRDRMSSQL01-DATA-DR","size":1073741824000,"path":"84c12d6f-7536-429a-8994-1b860446b672","volumeId":13889,"accountId":722,"format":"RAW","provisioningType":"THIN","id":13889,"hypervisorType":"KVM"}},"wait":0}}]
}

and then 3minute later

We get agent disconnected (virsh stuck, even virsh list don't work).
Nothing special in libvirt logs...

After this the volume still exist on CEPH, but I beleive later is again
removed via purge thread in ACS (I dont remember manually deleting them) -
which is very interesting actaully - why it does (seems to do) immediate
volume deletion, when later it's again removed (by purge thread I assume).

CHeers



On 19 February 2018 at 12:55, Simon Weller <swel...@ena.com.invalid> wrote:

> Also these -
>
> https://github.com/myENA/cloudstack/pull/20/commits/
> 1948ce5d24b87433ae9e8f4faebdfc20b56b751a
>
>
> https://github.com/myENA/cloudstack/pull/12/commits
>
>
>
>
>
> ________________________________
> From: Andrija Panic <andrija.pa...@gmail.com>
> Sent: Monday, February 19, 2018 5:23 AM
> To: dev
> Subject: Re: HA issues
>
> Hi Simon,
>
> a big thank you for this, will have our devs check this!
>
> Thanks!
>
> On 19 February 2018 at 09:02, Simon Weller <swel...@ena.com.invalid>
> wrote:
>
> > Andrija,
> >
> >
> > We pushed quite a few PRs on the exception and lockup issues related to
> > Ceph in the agent.
> >
> >
> > We have a PR for the deletion issue. See if you have it pulled into your
> > release - https://github.com/myENA/cloudstack/pull/9
> [https://avatars1.githubusercontent.com/u/1444686?s=400&v=4]<https://
> github.com/myENA/cloudstack/pull/9>
>
> context cleanup by leprechau · Pull Request #9 · myENA/cloudstack<https://
> github.com/myENA/cloudstack/pull/9>
> github.com
> cleanup rbd image and rados context even if exceptions are thrown in
> deletePhysicalDisk routine
>
>
>
> >
> >
> > - Si
> >
> >
> >
> >
> > ________________________________
> > From: Andrija Panic <andrija.pa...@gmail.com>
> > Sent: Saturday, February 17, 2018 1:49 PM
> > To: dev
> > Subject: Re: HA issues
> >
> > Hi Sean,
> >
> > (we have 2 threads interleaving on the libvirt lockd..) - so, did you
> > manage to understand what can cause the Agent Disconnect in most cases,
> for
> > you specifically? Is there any software (CloudStack) root cause
> > (disregarding i.e. networking issues etc)
> >
> > Just our examples, which you should probably not have:
> >
> > We had CEPH cluster running (with ACS), and there any exception in librbd
> > would crash JVM and the agent, but this has been fixed mostly -
> > Now get i.e. agent disconnect when ACS try to delete volume on CEPH (and
> > for some reason not succeed withing 30 minutes, volume deletion fails) -
> > then libvirt get's completety stuck (virsh list even dont work)...so
> agent
> > get's disconnect eventually.
> >
> > It would be good to get rid of agent disconnections in general, obviously
> > :) so that is why I'm asking (you are on NFS, so would like to see your
> > experience here).
> >
> > Thanks
> >
> > On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:
> >
> > > We were in the same situation as Nux.
> > >
> > > In our test environment we hit the issue with VMs not getting fenced
> and
> > > coming up on two hosts because of VM HA.   However, we updated some of
> > the
> > > logic for VM HA and turned on libvirtd's locking mechanism.  Now we are
> > > working great w/o IPMI.  The locking stops the VMs from starting
> > elsewhere,
> > > and everything recovers very nicely when the host starts responding
> > again.
> > >
> > > We are on 4.9.3 and haven't started testing with 4.11 yet, but it may
> > work
> > > along-side IPMI just fine - it would just have affect the fencing.
> > > However, we *currently* prefer how we are doing it now, because if the
> > > agent stops responding, but the host is still up, the VMs continue
> > running
> > > and no actual downtime is incurred.  Even when VM HA attempts to power
> on
> > > the VMs on another host, it just fails the power-up and the VMs
> continue
> > to
> > > run on the "agent disconnected" host. The host goes into alarm state
> and
> > > our NOC can look into what is wrong the agent on the host.  If IPMI was
> > > enabled, it sounds like it would power off the host (fence) and force
> > > downtime for us even if the VMs were actually running OK - and just the
> > > agent is unreachable.
> > >
> > > I plan on submitting our updates via a pull request at some point.
> But I
> > > can also send the updated code to anyone that wants to do some testing
> > > before then.
> > >
> > > -----Original Message-----
> > > From: Marcus [mailto:shadow...@gmail.com]
> > > Sent: Friday, February 16, 2018 11:27 AM
> > > To: dev@cloudstack.apache.org
> > > Subject: Re: HA issues
> > >
> > > From your other emails it sounds as though you do not have IPMI
> > > configured, nor host HA enabled, correct? In this case, the correct
> thing
> > > to do is nothing. If CloudStack cannot guarantee the VM state (as is
> the
> > > case with an unreachable hypervisor), it should do nothing, for fear of
> > > causing a split brain and corrupting the VM disk (VM running on two
> > hosts).
> > >
> > > Clustering and fencing is a tricky proposition. When CloudStack (or any
> > > other cluster manager) is not configured to or cannot guarantee state
> > then
> > > things will simply lock up, in this case your HA VM on your broken
> > > hypervisor will not run elsewhere. This has been the case for a long
> time
> > > with CloudStack, HA would only start a VM after the original hypervisor
> > > agent came back and reported no VM is running.
> > >
> > > The new feature, from what I gather, simply adds the possibility of
> > > CloudStack being able to reach out and shut down the hypervisor to
> > > guarantee state. At that point it can start the VM elsewhere. If
> > something
> > > fails in that process (IPMI unreachable, for example, or bad
> > credentials),
> > > you're still going to be stuck with a VM not coming back.
> > >
> > > It's the nature of the thing. I'd be wary of any HA solution that does
> > not
> > > reach out and guarantee state via host or storage fencing before
> > starting a
> > > VM elsewhere, as it will be making assumptions. Its entirely possible a
> > VM
> > > might be unreachable or unable to access it storage for a short while,
> a
> > > new instance of the VM is started elsewhere, and the original VM comes
> > back.
> > >
> > > On Wed, Jan 17, 2018 at 9:02 AM Nux! <n...@li.nux.ro> wrote:
> > >
> > > > Hi Rohit,
> > > >
> > > > I've reinstalled and tested. Still no go with VM HA.
> > > >
> > > > What I did was to kernel panic that particular HV ("echo c >
> > > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > > > What happened next is the HV got marked as "Alert", the VM on it was
> > > > all the time marked as "Running" and it was not migrated to another
> HV.
> > > > Once the panicked HV has booted back the VM reboots and becomes
> > > available.
> > > >
> > > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary
> > storage.
> > > > The VM has HA enabled service offering.
> > > > Host HA or OOBM configuration was not touched.
> > > >
> > > > Full log http://tmp.nux.ro/W3s-management-server.log
> > > >
> > > > --
> > > > Sent from the Delta quadrant using Borg technology!
> > > >
> > > > Nux!
> > > > www.nux.ro
> > > >
> > > > ----- Original Message -----
> > > > > From: "Rohit Yadav" <rohit.ya...@shapeblue.com>
> > > > > To: "dev" <dev@cloudstack.apache.org>
> > > > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > > > Subject: Re: HA issues
> > > >
> > > > > I performed VM HA sanity checks and was not able to reproduce any
> > > > regression
> > > > > against two KVM CentOS7 hosts in a cluster.
> > > > >
> > > > >
> > > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a
> > > > > KVM
> > > > host2 and
> > > > > killed it (powered off). After few minutes of CloudStack attempting
> > > > > to
> > > > find why
> > > > > the host (kvm agent) timed out, CloudStack kicked investigators,
> > > > > that eventually led KVM fencers to work and VM HA job kicked to
> > > > > start those
> > > > few VMs
> > > > > on host1 and the KVM host2 was put to "Down" state.
> > > > >
> > > > >
> > > > > - Rohit
> > > > >
> > > > > <https://cloudstack.apache.org>
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > >
> > > > > rohit.ya...@shapeblue.com
> > > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > > > >
> > > > >
> > > > >
> > > > > From: Rohit Yadav
> > > > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > > > To: dev
> > > > > Subject: Re: HA issues
> > > > >
> > > > >
> > > > > Hi Lucian,
> > > > >
> > > > >
> > > > > The "Host HA" feature is entirely different from VM HA, however,
> they
> > > > may work
> > > > > in tandem, so please stop using the terms interchangeably as it may
> > > > cause the
> > > > > community to believe a regression has been caused.
> > > > >
> > > > >
> > > > > The "Host HA" feature currently ships with only "Host HA" provider
> > for
> > > > KVM that
> > > > > is strictly tied to out-of-band management (IPMI for fencing, i.e
> > power
> > > > off and
> > > > > recovery, i.e. reboot) and NFS (as primary storage). (We also have
> a
> > > > provider
> > > > > for simulator, but that's for coverage/testing purposes).
> > > > >
> > > > >
> > > > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM
> is
> > > > enabled.
> > > > > The frameowkr allows interested parties may write their own HA
> > > providers
> > > > for a
> > > > > hypervisor that can use a different strategy/mechanism for
> > > > fencing/recovery of
> > > > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > > > activity
> > > > > checker that is non-NFS based.
> > > > >
> > > > >
> > > > > The "Host HA" feature ships disabled by default and does not cause
> > any
> > > > > interference with VM HA. However, when enabled and configured
> > > correctly,
> > > > it is
> > > > > a known limitation that when it is unable to successfully perform
> > > > recovery or
> > > > > fencing tasks it may not trigger VM HA. We can discuss how to
> handle
> > > > such cases
> > > > > (thoughts?). "Host HA" would try couple of times to recover and
> > failing
> > > > to do
> > > > > so, it would eventually trigger a host fencing task. If it's unable
> > to
> > > > fence a
> > > > > host, it will indefinitely attempt to fence the host (the host
> state
> > > > will be
> > > > > stuck at fencing state in cloud.ha_config table for example) and
> > alerts
> > > > will be
> > > > > sent to admin who can do some manual intervention to handle such
> > > > situations (if
> > > > > you've email/smtp enabled, you should see alert emails).
> > > > >
> > > > >
> > > > > We can discuss how to improve and have a workaround for the case
> > you've
> > > > hit,
> > > > > thanks for sharing.
> > > > >
> > > > >
> > > > > - Rohit
> > > > >
> > > > > ________________________________
> > > > > From: Nux! <n...@li.nux.ro>
> > > > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > > > To: dev
> > > > > Subject: Re: HA issues
> > > > >
> > > > > Ok, reinstalled and re-tested.
> > > > >
> > > > > What I've learned:
> > > > >
> > > > > - HA only works now if OOB is configured, the old way HA no longer
> > > > applies -
> > > > > this can be good and bad, not everyone has IPMIs
> > > > >
> > > > > - HA only works if IPMI is reachable. I've pulled the cord on a HV
> > and
> > > > HA failed
> > > > > to do its thing, leaving me with a HV down along with all the VMs
> > > running
> > > > > there. That's bad.
> > > > > I've opened this ticket for it:
> > > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > > > >
> > > > > Let me know if you need any extra info or stuff to test.
> > > > >
> > > > > Regards,
> > > > > Lucian
> > > > >
> > > > > --
> > > > > Sent from the Delta quadrant using Borg technology!
> > > > >
> > > > > Nux!
> > > > > www.nux.ro
> > > > >
> > > > > ----- Original Message -----
> > > > >> From: "Nux!" <n...@li.nux.ro>
> > > > >> To: "dev" <dev@cloudstack.apache.org>
> > > > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > > > >> Subject: Re: HA issues
> > > > >
> > > > >> I'll reinstall my setup and try again, just to be sure I'm working
> > on
> > > a
> > > > clean
> > > > >> slate.
> > > > >>
> > > > >> --
> > > > >> Sent from the Delta quadrant using Borg technology!
> > > > >>
> > > > >> Nux!
> > > > >> www.nux.ro
> > > > >>
> > > > >> ----- Original Message -----
> > > > >>> From: "Rohit Yadav" <rohit.ya...@shapeblue.com>
> > > > >>> To: "dev" <dev@cloudstack.apache.org>
> > > > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > > > >>> Subject: Re: HA issues
> > > > >>
> > > > >>> Hi Lucian,
> > > > >>>
> > > > >>>
> > > > >>> If you're talking about the new HostHA feature (with
> KVM+nfs+ipmi),
> > > > please refer
> > > > >>> to following docs:
> > > > >>>
> > > > >>>
> > > > http://docs.cloudstack.apache.org/projects/cloudstack-
> > > administration/en/latest/hosts.html#out-of-band-management
> > > > >>>
> > > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > > > >>>
> > > > >>>
> > > > >>> We'll need to you look at logs perhaps create a JIRA ticket with
> > the
> > > > logs and
> > > > >>> details? If you saw ipmi based reboot, then host-ha indeed tried
> to
> > > > recover
> > > > >>> i.e. reboot the host, once hostha has done its work it would
> > schedule
> > > > HA for VM
> > > > >>> as soon as the recovery operation succeeds (we've simulator and
> kvm
> > > > based
> > > > >>> marvin tests for such scenarios).
> > > > >>>
> > > > >>>
> > > > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > > > failure?
> > > > >>>
> > > > >>>
> > > > >>> - Rohit
> > > > >>>
> > > > >>> <https://cloudstack.apache.org>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> ________________________________
> > > > >>> From: Nux! <n...@li.nux.ro>
> > > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > > > >>> To: dev
> > > > >>> Subject: [4.11] HA issues
> > > > >>>
> > > > >>> Hi,
> > > > >>>
> > > > >>> I see there's a new HA engine for KVM and IPMI support which is
> > > really
> > > > nice,
> > > > >>> however it seems hit and miss.
> > > > >>> I have created an instance with HA offering, kernel panicked one
> of
> > > the
> > > > >>> hypervisors - after a while the server was rebooted via IPMI
> > > probably,
> > > > but the
> > > > >>> instance never moved to a running hypervisor and even after the o
> > > > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> > > even+after+the+o&entry=gmail&source=g>
> > > > riginal
> > > > >>> hypervisor came back it was still left in Stopped state.
> > > > >>> Is there any extra things I need to set up to have proper HA?
> > > > >>>
> > > > >>> Regards,
> > > > >>> Lucian
> > > > >>>
> > > > >>> --
> > > > >>> Sent from the Delta quadrant using Borg technology!
> > > > >>>
> > > > >>> Nux!
> > > > >>> www.nux.ro
> > > > >>>
> > > > >>> rohit.ya...@shapeblue.com
> > > > >>> www.shapeblue.com<http://www.shapeblue.com>
> > > > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > > @shapeblue
> > > >
> > >
> >
> >
> >
> > --
> >
> > Andrija Panić
> >
>
>
>
> --
>
> Andrija Panić
>



-- 

Andrija Panić

Re: HA issues

Reply via email to