Seems to me, that I'm about to issue something similar to: update cloud.vm_instance set ha = 0 where ha =1...
Now seriously, wondering, per the manual - if you define HA host tag on the global config level, and then have NO hosts with that tag - MGMT will not be able to start VMs on other hosts, since there are no hosts that are dedicated got HA destination ? Does this makes sense ? I guess the VMs will be just marked as Stopped in the GUI/databse, but unable to start them... Stupid proposal, but... ? On 16 February 2015 at 16:22, Logan Barfield <lbarfi...@tqhosting.com> wrote: > Some sort of fencing independent of the management server is > definitely needed. HA in general (particularly on KVM) is all kinds > of unpredictable/buggy right now. > > I like the idea of having a switch that an admin can flip to stop HA. > In fact I think a better job control system in general (e.g., being > able to stop/restart/manually start tasks) would be awesome, if it's > feasible. > > Thank You, > > Logan Barfield > Tranquil Hosting > > > On Mon, Feb 16, 2015 at 10:05 AM, Wido den Hollander <w...@widodh.nl> > wrote: > > > > > > On 16-02-15 13:16, Andrei Mikhailovsky wrote: > >> I had similar issues at least two or thee times. The host agent would > disconnect from the management server. The agent would not connect back to > the management server without manual intervention, however, it would > happily continue running the vms. The management server would initiate the > HA and fire up vms, which are already running on the disconnected host. I > ended up with a handful of vms and virtual routers being ran on two > hypervisors, thus corrupting the disk and having all sorts of issues ((( . > >> > >> I think there has to be a better way of dealing with this case. At > least on an image level. Perhaps a host should keep some sort of lock file > or a file for every image where it would record a time stamp. Something > like: > >> > >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and > >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp > >> > >> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk > image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's > time stamp. > >> > >> The hypervisor should record the time stamp in this file while the vm > is running. Let's say every 5-10 seconds. If the timestamp is old, we can > assume that the volume is no longer used by the hypervisor. > >> > >> When a vm is started, the timestamp file should be checked and if the > timestamp is recent, the vm should not start, otherwise, the vm should > start and the timestamp file should be regularly updated. > >> > >> I am sure there are better ways of doing this, but at least this method > should not allow two vms running on different hosts to use the same volume > and corrupt the data. > >> > >> In ceph, as far as I remember, a new feature is being developed to > provide a locking mechanism of an rbd image. Not sure if this will do the > job? > >> > > > > Something like this is still on my wishlist for Ceph/RBD, something like > > you propose. > > > > For NFS we currently have this in place, but for Ceph/RBD we don't. It's > > a matter of code in the Agent and the investigators inside the > > Management Server which decide if HA should kick in. > > > > Wido > > > >> Andrei > >> > >> ----- Original Message ----- > >> > >>> From: "Wido den Hollander" <w...@widodh.nl> > >>> To: dev@cloudstack.apache.org > >>> Sent: Monday, 16 February, 2015 11:32:13 AM > >>> Subject: Re: Disable HA temporary ? > >> > >>> On 16-02-15 11:00, Andrija Panic wrote: > >>>> Hi team, > >>>> > >>>> I just had funny behaviour few days ago - one of my hosts was under > >>>> heavy > >>>> load (some disk/network load) and it went disconnected from MGMT > >>>> server. > >>>> > >>>> Then MGMT server stared doing HA thing, but without being able to > >>>> make sure > >>>> that the VMs on the disconnected hosts are really shutdown (and > >>>> they were > >>>> NOT). > >>>> > >>>> So MGMT started again some VMs on other hosts, thus resulting in > >>>> having 2 > >>>> copies of the same VM, using shared strage - so corruption happened > >>>> on the > >>>> disk. > >>>> > >>>> Is there a way to temporary disable HA feature on global level, or > >>>> anything > >>>> similar ? > >> > >>> Not that I'm aware of, but this is something I also ran in to a > >>> couple > >>> of times. > >> > >>> It would indeed be nice if there could be a way to stop the HA > >>> process > >>> completely as an Admin. > >> > >>> Wido > >> > >>>> Thanks > >>>> > >> > -- Andrija Panić