Hi Marc,
I like the idea, I guess a locking-service was needed in CloudStack to no only solve the issue of locking and getting rid of DB-based lock (which I suppose if we can get rid of, may help people migrate to mysql-clusters with active-active setup which cannot be used due to LOCK usage), but also fix the issue of claim and ownership (i.e which management server owns which resource such as hosts, vms, volumes etc). To retain CloudStack as a turnkey/standalone solution embedded-ZooKeeper may be used for this purpose and the new CA framework if applicable could be used to secure a cluster of mgmt server running the ZK plugin/services. This will also require refactoring of the job manager/service layer to be locking-service aware. I guess a general and pluggable locking service manager could be implemented for this purpose that supports plugins with a default plugin that is (embedded) ZK based. With the agent-management server model, CloudStack agents such as KVM, SSVM and CPVM agents currently only have a single mgmt server 'host' IP that it connects to. With the introduction of the CA framework I had tried to change this to a list of host/IPs that it tries to connect to on disconnection (say a mgmt server shutdown) and as mentioned there is PR 2309 that further improves/introduces a way of balancing. To solve the issue of balancing (claim+ownership) of the agents using a locking-service such as ZK, across the cluster of management servers we may need a locking service/manager that can help. It can trigger events such as rebalancing of tasks. We may also explore use Gossip and other ways of discovery propagation and rebalancing of agents with the new locking-service/manager. I'm excited to see your attempt at solving the problem. - Rohit ________________________________ From: Marc-Aurèle Brothier <ma...@exoscale.ch> Sent: Monday, December 18, 2017 7:26:21 PM To: dev@cloudstack.apache.org Subject: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs Hi everyone, Another point, another thread. Currently when shutting down a management server, despite all the "stop()" method not being called as far as I know, the server could be in the middle of processing an async job task. It will lead to a failed job since the response won't be delivered to the correct management server even though the job might have succeed on the agent. To overcome this limitation due to our weekly production upgrades, we added a pre-shutdown mechanism which works along side HA-proxy. The management server keeps a eye onto a file "lb-agent" in which some keywords can be written following the HA proxy guide ( https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#5.2-agent-check). When it finds "maint", "stopped" or "drain", it stops those threads: - AsyncJobManager._heartbeatScheduler: responsible to fetch and start execution of AsyncJobs - AlertManagerImpl._timer: responsible to send capacity check commands - StatsCollector._executor: responsible to schedule stats command Then the management server stops most of its scheduled tasks. The correct thing to do before shutting down the server would be to send "rebalance/reconnect" commands to all agents connected on that management server to ensure that commands won't go through this server at all. Here, HA-proxy is responsible to stop sending API requests to the corresponding server with the help of this local agent check. In case you want to cancel the maintenance shutdown, you could write "up/ready" in the file and the different schedulers will be restarted. This is really more a change for operation around CS for people doing live upgrade on a regular basis, so I'm unsure if the community would want such a change in the code base. It goes a bit in the opposite direction of the change for removing the need of HA-proxy https://github.com/apache/cloudstack/pull/2309 If there is enough positive feedback for such a change, I will port them to match with the upstream branch in a PR. Kind regards, Marc-Aurèle rohit.ya...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue