Now without spellchecking :)

This is not simple e.g. for VMware. Each management server also acts as an 
agent proxy so tasks against a particular ESX host will be always forwarded. 
That right answer will be to support a native “maintenance mode” for management 
server. When entered to such mode the management server should release all 
agents including SSVM, block/redirect API calls and login request and finish 
all async job it originated.



On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy 
<serg...@hotmail.com<mailto:serg...@hotmail.com>> wrote:

This is not simple e.g. for VMware. Each management server also acts as an 
agent proxy so tasks against a particular ESX host will be always forwarded. 
That right answer will be to a native support for “maintenance mode” for 
management server. When entered to such mode the management server should 
release all agents including save, block/redirect API calls and login request 
and finish all a sync job it originated.

Sent from my iPhone

On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner 
<rafaelweingart...@gmail.com<mailto:rafaelweingart...@gmail.com>> wrote:

Ilya, still regarding the management server that is being shut down issue;
if other MSs/or maybe system VMs (I am not sure to know if they are able to
do such tasks) can direct/redirect/send new jobs to this management server
(the one being shut down), the process might never end because new tasks
are always being created for the management server that we want to shut
down. Is this scenario possible?

That is why I mentioned blocking the port 8250 for the “graceful-shutdown”.

If this scenario is not possible, then everything s fine.


On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev 
<ilya.mailing.li...@gmail.com<mailto:ilya.mailing.li...@gmail.com>>
wrote:

I'm thinking of using a configuration from "job.cancel.threshold.minutes" -
it will be the longest

    "category": "Advanced",

    "description": "Time (in minutes) for async-jobs to be forcely
cancelled if it has been in process for long",

    "name": "job.cancel.threshold.minutes",

    "value": "60"




On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
rafaelweingart...@gmail.com<mailto:rafaelweingart...@gmail.com>> wrote:

Big +1 for this feature; I only have a few doubts.

* Regarding the tasks/jobs that management servers (MSs) execute; are
these
tasks originate from requests that come to the MS, or is it possible that
requests received by one management server to be executed by other? I
mean,
if I execute a request against MS1, will this request always be
executed/threated by MS1, or is it possible that this request is executed
by another MS (e.g. MS2)?

* I would suggest that after we block traffic coming from
8080/8443/8250(we
will need to block this as well right?), we can log the execution of
tasks.
I mean, something saying, there are XXX tasks (enumerate tasks) still
being
executed, we will wait for them to finish before shutting down.

* The timeout (60 minutes suggested) could be global settings that we can
load before executing the graceful-shutdown.

On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <
ilya.mailing.li...@gmail.com<mailto:ilya.mailing.li...@gmail.com>

wrote:

Use case:
In any environment - time to time - administrator needs to perform a
maintenance. Current stop sequence of cloudstack management server will
ignore the fact that there may be long running async jobs - and
terminate
the process. This in turn can create a poor user experience and
occasional
inconsistency  in cloudstack db.

This is especially painful in large environments where the user has
thousands of nodes and there is a continuous patching that happens
around
the clock - that requires migration of workload from one node to
another.

With that said - i've created a script that monitors the async job
queue
for given MS and waits for it complete all jobs. More details are
posted
below.

I'd like to introduce "graceful-shutdown" into the systemctl/service of
cloudstack-management service.

The details of how it will work is below:

Workflow for graceful shutdown:
Using iptables/firewalld - block any connection attempts on 8080/8443
(we
can identify the ports dynamically)
Identify the MSID for the node, using the proper msid - query
async_job
table for
1) any jobs that are still running (or job_status=“0”)
2) job_dispatcher not like “pseudoJobDispatcher"
3) job_init_msid=$my_ms_id

Monitor this async_job table for 60 minutes - until all async jobs for
MSID
are done, then proceed with shutdown
  If failed for any reason or terminated, catch the exit via trap
command
and unblock the 8080/8443

Comments are welcome

Regards,
ilya




--
Rafael Weingärtner





--
Rafael Weingärtner

Reply via email to