This is not simple e.g. for VMware. Each management server also acts as an agent proxy so tasks against a particular ESX host will be always forwarded. That right answer will be to a native support for “maintenance mode” for management server. When entered to such mode the management server should release all agents including save, block/redirect API calls and login request and finish all a sync job it originated.
Sent from my iPhone > On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <rafaelweingart...@gmail.com> > wrote: > > Ilya, still regarding the management server that is being shut down issue; > if other MSs/or maybe system VMs (I am not sure to know if they are able to > do such tasks) can direct/redirect/send new jobs to this management server > (the one being shut down), the process might never end because new tasks > are always being created for the management server that we want to shut > down. Is this scenario possible? > > That is why I mentioned blocking the port 8250 for the “graceful-shutdown”. > > If this scenario is not possible, then everything s fine. > > > On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev <ilya.mailing.li...@gmail.com> > wrote: > >> I'm thinking of using a configuration from "job.cancel.threshold.minutes" - >> it will be the longest >> >> "category": "Advanced", >> >> "description": "Time (in minutes) for async-jobs to be forcely >> cancelled if it has been in process for long", >> >> "name": "job.cancel.threshold.minutes", >> >> "value": "60" >> >> >> >> >> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner < >> rafaelweingart...@gmail.com> wrote: >> >>> Big +1 for this feature; I only have a few doubts. >>> >>> * Regarding the tasks/jobs that management servers (MSs) execute; are >> these >>> tasks originate from requests that come to the MS, or is it possible that >>> requests received by one management server to be executed by other? I >> mean, >>> if I execute a request against MS1, will this request always be >>> executed/threated by MS1, or is it possible that this request is executed >>> by another MS (e.g. MS2)? >>> >>> * I would suggest that after we block traffic coming from >> 8080/8443/8250(we >>> will need to block this as well right?), we can log the execution of >> tasks. >>> I mean, something saying, there are XXX tasks (enumerate tasks) still >> being >>> executed, we will wait for them to finish before shutting down. >>> >>> * The timeout (60 minutes suggested) could be global settings that we can >>> load before executing the graceful-shutdown. >>> >>> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev < >> ilya.mailing.li...@gmail.com >>>> >>> wrote: >>> >>>> Use case: >>>> In any environment - time to time - administrator needs to perform a >>>> maintenance. Current stop sequence of cloudstack management server will >>>> ignore the fact that there may be long running async jobs - and >> terminate >>>> the process. This in turn can create a poor user experience and >>> occasional >>>> inconsistency in cloudstack db. >>>> >>>> This is especially painful in large environments where the user has >>>> thousands of nodes and there is a continuous patching that happens >> around >>>> the clock - that requires migration of workload from one node to >> another. >>>> >>>> With that said - i've created a script that monitors the async job >> queue >>>> for given MS and waits for it complete all jobs. More details are >> posted >>>> below. >>>> >>>> I'd like to introduce "graceful-shutdown" into the systemctl/service of >>>> cloudstack-management service. >>>> >>>> The details of how it will work is below: >>>> >>>> Workflow for graceful shutdown: >>>> Using iptables/firewalld - block any connection attempts on 8080/8443 >>> (we >>>> can identify the ports dynamically) >>>> Identify the MSID for the node, using the proper msid - query >> async_job >>>> table for >>>> 1) any jobs that are still running (or job_status=“0”) >>>> 2) job_dispatcher not like “pseudoJobDispatcher" >>>> 3) job_init_msid=$my_ms_id >>>> >>>> Monitor this async_job table for 60 minutes - until all async jobs for >>> MSID >>>> are done, then proceed with shutdown >>>> If failed for any reason or terminated, catch the exit via trap >>> command >>>> and unblock the 8080/8443 >>>> >>>> Comments are welcome >>>> >>>> Regards, >>>> ilya >>>> >>> >>> >>> >>> -- >>> Rafael Weingärtner >>> >> > > > > -- > Rafael Weingärtner