I had similar issues at least two or thee times. The host agent would 
disconnect from the management server. The agent would not connect back to the 
management server without manual intervention, however, it would happily 
continue running the vms. The management server would initiate the HA and fire 
up vms, which are already running on the disconnected host. I ended up with a 
handful of vms and virtual routers being ran on two hypervisors, thus 
corrupting the disk and having all sorts of issues ((( . 

I think there has to be a better way of dealing with this case. At least on an 
image level. Perhaps a host should keep some sort of lock file or a file for 
every image where it would record a time stamp. Something like: 

f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and 
f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp 

Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image 
and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's time stamp. 

The hypervisor should record the time stamp in this file while the vm is 
running. Let's say every 5-10 seconds. If the timestamp is old, we can assume 
that the volume is no longer used by the hypervisor. 

When a vm is started, the timestamp file should be checked and if the timestamp 
is recent, the vm should not start, otherwise, the vm should start and the 
timestamp file should be regularly updated. 

I am sure there are better ways of doing this, but at least this method should 
not allow two vms running on different hosts to use the same volume and corrupt 
the data. 

In ceph, as far as I remember, a new feature is being developed to provide a 
locking mechanism of an rbd image. Not sure if this will do the job? 

Andrei 

----- Original Message -----

> From: "Wido den Hollander" <w...@widodh.nl>
> To: dev@cloudstack.apache.org
> Sent: Monday, 16 February, 2015 11:32:13 AM
> Subject: Re: Disable HA temporary ?

> On 16-02-15 11:00, Andrija Panic wrote:
> > Hi team,
> >
> > I just had funny behaviour few days ago - one of my hosts was under
> > heavy
> > load (some disk/network load) and it went disconnected from MGMT
> > server.
> >
> > Then MGMT server stared doing HA thing, but without being able to
> > make sure
> > that the VMs on the disconnected hosts are really shutdown (and
> > they were
> > NOT).
> >
> > So MGMT started again some VMs on other hosts, thus resulting in
> > having 2
> > copies of the same VM, using shared strage - so corruption happened
> > on the
> > disk.
> >
> > Is there a way to temporary disable HA feature on global level, or
> > anything
> > similar ?

> Not that I'm aware of, but this is something I also ran in to a
> couple
> of times.

> It would indeed be nice if there could be a way to stop the HA
> process
> completely as an Admin.

> Wido

> > Thanks
> >

Reply via email to