Re: HA is broken on master

Kelven Yang Wed, 02 Oct 2013 16:41:28 -0700

I'm now rebasing the VMsync work to master, will send a merge request once
I'm done


Kelven

On 10/2/13 4:31 PM, "Chiradeep Vittal" <chiradeep.vit...@citrix.com> wrote:

>My bad. I thought this was merged into master, but it isn't.
>
>On 10/2/13 4:24 PM, "David Nalley" <da...@gnsa.us> wrote:
>
>>Why is the work happening in master?
>>
>>
>>On Wed, Oct 2, 2013 at 7:09 PM, Chiradeep Vittal
>><chiradeep.vit...@citrix.com> wrote:
>>> Perhaps as a result of this work:
>>> https://cwiki.apache.org/confluence/x/tYvlAQ
>>> I think Kelven is trying to separate the job state (starting, stopping)
>>> from the actual VM state.
>>>
>>> On 10/2/13 3:36 PM, "Darren Shepherd" <darren.s.sheph...@gmail.com>
>>>wrote:
>>>
>>>>Alex,
>>>>
>>>>In scheduleRestart() when it calls _itMgr.advanceStop() it used to
>>>>pass the VO.  Now it passes a UUID.  So the VO the HA manager holds is
>>>>out of sync with the DB and the recorded previous state and update
>>>>count are wrong, so HA will just stop the VM in the worker.
>>>>
>>>>I really think the update count approach is far too fragile.  For
>>>>example, currently if you try to start a VM and it fails, the update
>>>>count will change.  But the current code will record the new update
>>>>count so the next try it will have the updated count.  I can see the
>>>>following issue, maybe there's some work around for it.  Imagine you
>>>>have a large failure, the stuff really hits the fan.  So you have
>>>>1000's of HA jobs trying to run and things just keep failing.  So to
>>>>stop the churn you shutdown the mgmt stack to figure out whats up with
>>>>infrastructure.  There's a really good chance that you would kill the
>>>>mgmt stack while a VM was in starting.  So now the hawork update count
>>>>will be out of sync with the current DB.  So when you bring the mgmt
>>>>stack back up.  It won't try to restart that VM.
>>>>
>>>>Maybe that situation is taken care of somehow, but I could probably
>>>>dream up another one.  I think it is far simpler that when a user
>>>>starts a VM, you record in the vm_instance table, in a new column,
>>>>"Should be running", then when the HA worker processes the record, it
>>>>will always say it should be running.  If the user does a stop, you
>>>>clear that column.  This has the added benefit of when things are bad
>>>>and a user starts clicking restart/start, they won't mess with the HA.
>>>> I think, maybe things have changed, but before what I would see is
>>>>that we'd have an issue so VMs should be started, but weren't.  So HA
>>>>was trying, but it kept failing.  The user would login and see they're
>>>>VM is down, so they would click start.  But that would fail (similar
>>>>to how HA was also failing).  So the VM would stay in stopped, but
>>>>since they touched the VM, the update count changed and HA wouldn't
>>>>start it back up when the infra worked again.  So customers who
>>>>proactively tried to do something would get penalized in that their
>>>>downtime was longer because cloudstack wouldn't bring their VM back up
>>>>like the other VMs.
>>>>
>>>>Darren
>>>
>

Re: HA is broken on master

Reply via email to