I'm now rebasing the VMsync work to master, will send a merge request once I'm done
Kelven On 10/2/13 4:31 PM, "Chiradeep Vittal" <chiradeep.vit...@citrix.com> wrote: >My bad. I thought this was merged into master, but it isn't. > >On 10/2/13 4:24 PM, "David Nalley" <da...@gnsa.us> wrote: > >>Why is the work happening in master? >> >> >>On Wed, Oct 2, 2013 at 7:09 PM, Chiradeep Vittal >><chiradeep.vit...@citrix.com> wrote: >>> Perhaps as a result of this work: >>> https://cwiki.apache.org/confluence/x/tYvlAQ >>> I think Kelven is trying to separate the job state (starting, stopping) >>> from the actual VM state. >>> >>> On 10/2/13 3:36 PM, "Darren Shepherd" <darren.s.sheph...@gmail.com> >>>wrote: >>> >>>>Alex, >>>> >>>>In scheduleRestart() when it calls _itMgr.advanceStop() it used to >>>>pass the VO. Now it passes a UUID. So the VO the HA manager holds is >>>>out of sync with the DB and the recorded previous state and update >>>>count are wrong, so HA will just stop the VM in the worker. >>>> >>>>I really think the update count approach is far too fragile. For >>>>example, currently if you try to start a VM and it fails, the update >>>>count will change. But the current code will record the new update >>>>count so the next try it will have the updated count. I can see the >>>>following issue, maybe there's some work around for it. Imagine you >>>>have a large failure, the stuff really hits the fan. So you have >>>>1000's of HA jobs trying to run and things just keep failing. So to >>>>stop the churn you shutdown the mgmt stack to figure out whats up with >>>>infrastructure. There's a really good chance that you would kill the >>>>mgmt stack while a VM was in starting. So now the hawork update count >>>>will be out of sync with the current DB. So when you bring the mgmt >>>>stack back up. It won't try to restart that VM. >>>> >>>>Maybe that situation is taken care of somehow, but I could probably >>>>dream up another one. I think it is far simpler that when a user >>>>starts a VM, you record in the vm_instance table, in a new column, >>>>"Should be running", then when the HA worker processes the record, it >>>>will always say it should be running. If the user does a stop, you >>>>clear that column. This has the added benefit of when things are bad >>>>and a user starts clicking restart/start, they won't mess with the HA. >>>> I think, maybe things have changed, but before what I would see is >>>>that we'd have an issue so VMs should be started, but weren't. So HA >>>>was trying, but it kept failing. The user would login and see they're >>>>VM is down, so they would click start. But that would fail (similar >>>>to how HA was also failing). So the VM would stay in stopped, but >>>>since they touched the VM, the update count changed and HA wouldn't >>>>start it back up when the infra worked again. So customers who >>>>proactively tried to do something would get penalized in that their >>>>downtime was longer because cloudstack wouldn't bring their VM back up >>>>like the other VMs. >>>> >>>>Darren >>> >