Hi Marc-Aurèle,

Personally, my utopia would be to be able to pass async jobs between mgmt. 
servers.
So rather than waiting in indeterminate time for a snapshot to complete, 
monitoring the job is passed to another management server. 

I would LOVE that something like Zookeeper monitored the state of the mgmt. 
servers, so that 'other' management servers could take over the async jobs in 
the (unlikely) event that a management server becomes unavailable.



Kind regards,

Paul Angus

paul.an...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-----Original Message-----
From: Marc-Aurèle Brothier [mailto:ma...@exoscale.ch] 
Sent: 18 December 2017 13:56
To: dev@cloudstack.apache.org
Subject: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs

Hi everyone,

Another point, another thread. Currently when shutting down a management 
server, despite all the "stop()" method not being called as far as I know, the 
server could be in the middle of processing an async job task. It will lead to 
a failed job since the response won't be delivered to the correct management 
server even though the job might have succeed on the agent. To overcome this 
limitation due to our weekly production upgrades, we added a pre-shutdown 
mechanism which works along side HA-proxy. The management server keeps a eye 
onto a file "lb-agent" in which some keywords can be written following the HA 
proxy guide ( 
https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#5.2-agent-check).
When it finds "maint", "stopped" or "drain", it stops those threads:
 - AsyncJobManager._heartbeatScheduler: responsible to fetch and start 
execution of AsyncJobs
 - AlertManagerImpl._timer: responsible to send capacity check commands
 - StatsCollector._executor: responsible to schedule stats command

Then the management server stops most of its scheduled tasks. The correct thing 
to do before shutting down the server would be to send "rebalance/reconnect" 
commands to all agents connected on that management server to ensure that 
commands won't go through this server at all.

Here, HA-proxy is responsible to stop sending API requests to the corresponding 
server with the help of this local agent check.

In case you want to cancel the maintenance shutdown, you could write "up/ready" 
in the file and the different schedulers will be restarted.

This is really more a change for operation around CS for people doing live 
upgrade on a regular basis, so I'm unsure if the community would want such a 
change in the code base. It goes a bit in the opposite direction of the change 
for removing the need of HA-proxy
https://github.com/apache/cloudstack/pull/2309

If there is enough positive feedback for such a change, I will port them to 
match with the upstream branch in a PR.

Kind regards,
Marc-Aurèle

Reply via email to