Alex, In scheduleRestart() when it calls _itMgr.advanceStop() it used to pass the VO. Now it passes a UUID. So the VO the HA manager holds is out of sync with the DB and the recorded previous state and update count are wrong, so HA will just stop the VM in the worker.
I really think the update count approach is far too fragile. For example, currently if you try to start a VM and it fails, the update count will change. But the current code will record the new update count so the next try it will have the updated count. I can see the following issue, maybe there's some work around for it. Imagine you have a large failure, the stuff really hits the fan. So you have 1000's of HA jobs trying to run and things just keep failing. So to stop the churn you shutdown the mgmt stack to figure out whats up with infrastructure. There's a really good chance that you would kill the mgmt stack while a VM was in starting. So now the hawork update count will be out of sync with the current DB. So when you bring the mgmt stack back up. It won't try to restart that VM. Maybe that situation is taken care of somehow, but I could probably dream up another one. I think it is far simpler that when a user starts a VM, you record in the vm_instance table, in a new column, "Should be running", then when the HA worker processes the record, it will always say it should be running. If the user does a stop, you clear that column. This has the added benefit of when things are bad and a user starts clicking restart/start, they won't mess with the HA. I think, maybe things have changed, but before what I would see is that we'd have an issue so VMs should be started, but weren't. So HA was trying, but it kept failing. The user would login and see they're VM is down, so they would click start. But that would fail (similar to how HA was also failing). So the VM would stay in stopped, but since they touched the VM, the update count changed and HA wouldn't start it back up when the infra worked again. So customers who proactively tried to do something would get penalized in that their downtime was longer because cloudstack wouldn't bring their VM back up like the other VMs. Darren