Andrija This is the reason for this enhancement, snapshot, migration and others - are all async jobs - and therefore should be tracked in async_job table under specific MS.It is known they may take a while to complete and last thing we want is to interrupt it.
Depending on what value you have set in Configurations - it may time out - but continue working on the background.. meaning cloudstack will stop tracking the async job beyond specific interval - but cloudstack agent will push forward. I dont see a harm of taking the server offline - if there are no jobs that are being tracked. However - we should not stop the server - if we identify any jobs that are still active. The user can decide to append the forceful shutdown after the graceful one if he feels like it. For example [shell] # service cloudstack-management graceful-shutdown; service cloudstack-management shutdown For your issue, Please check the value for "job.cancel.threshold.minutes" "category": "Advanced", "description": "Time (in minutes) for async-jobs to be forcely cancelled if it has been in process for long", "name": "job.cancel.threshold.minutes", "value": "60" I propose for the graceful shutdown command to source "job.cancel.threshold.minutes" as a max value - before giving up on the endeavor. The only issue i'm on the fence about - is blocking access to 8080/8443 - if you have a single node setup. There is a chance you may block the access to cloudstack for over an hour - and that may not be what you intended. Perhaps we add a parameter in db.properties for "graceful.shutdown.block.api.server = true/false" Regards, ilya On Wed, Apr 4, 2018 at 2:22 PM, Andrija Panic <andrija.pa...@gmail.com> wrote: > One comment here (I had to shutdown whole DC for few hours recently....), > please make sure to perhaps at least consider snapshoting process as the > special case - it can take few hours for snapshot to complete really (copy > process from Primary to Secondary Storage) > > I did (in my recent unfortunate DC shutdown), actually stop MS (we also > have script to identify running async jobs), so we stop it once safe, but > any running qemu-img processes (we use kVM) need to be killed manually > (ansbile) after MS is stopped, etc,etc... > > I can assume most jobs can take reasonable long time to complete, but > snapshots are probably the biggest exceptions as can take extremely long > time to complete... > > Cheers > > On 4 April 2018 at 22:46, Tutkowski, Mike <mike.tutkow...@netapp.com> > wrote: > > > I may be remembering this incorrectly, but from what I recall, if a > > resource is owned by one MS and a request related to that resource comes > in > > to another MS, the MS that received the request passes it on to the other > > MS. > > > > > On Apr 4, 2018, at 2:36 PM, Rafael Weingärtner < > > rafaelweingart...@gmail.com> wrote: > > > > > > Big +1 for this feature; I only have a few doubts. > > > > > > * Regarding the tasks/jobs that management servers (MSs) execute; are > > these > > > tasks originate from requests that come to the MS, or is it possible > that > > > requests received by one management server to be executed by other? I > > mean, > > > if I execute a request against MS1, will this request always be > > > executed/threated by MS1, or is it possible that this request is > executed > > > by another MS (e.g. MS2)? > > > > > > * I would suggest that after we block traffic coming from > > 8080/8443/8250(we > > > will need to block this as well right?), we can log the execution of > > tasks. > > > I mean, something saying, there are XXX tasks (enumerate tasks) still > > being > > > executed, we will wait for them to finish before shutting down. > > > > > > * The timeout (60 minutes suggested) could be global settings that we > can > > > load before executing the graceful-shutdown. > > > > > > On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev < > > ilya.mailing.li...@gmail.com> > > > wrote: > > > > > >> Use case: > > >> In any environment - time to time - administrator needs to perform a > > >> maintenance. Current stop sequence of cloudstack management server > will > > >> ignore the fact that there may be long running async jobs - and > > terminate > > >> the process. This in turn can create a poor user experience and > > occasional > > >> inconsistency in cloudstack db. > > >> > > >> This is especially painful in large environments where the user has > > >> thousands of nodes and there is a continuous patching that happens > > around > > >> the clock - that requires migration of workload from one node to > > another. > > >> > > >> With that said - i've created a script that monitors the async job > queue > > >> for given MS and waits for it complete all jobs. More details are > posted > > >> below. > > >> > > >> I'd like to introduce "graceful-shutdown" into the systemctl/service > of > > >> cloudstack-management service. > > >> > > >> The details of how it will work is below: > > >> > > >> Workflow for graceful shutdown: > > >> Using iptables/firewalld - block any connection attempts on 8080/8443 > > (we > > >> can identify the ports dynamically) > > >> Identify the MSID for the node, using the proper msid - query > async_job > > >> table for > > >> 1) any jobs that are still running (or job_status=“0”) > > >> 2) job_dispatcher not like “pseudoJobDispatcher" > > >> 3) job_init_msid=$my_ms_id > > >> > > >> Monitor this async_job table for 60 minutes - until all async jobs for > > MSID > > >> are done, then proceed with shutdown > > >> If failed for any reason or terminated, catch the exit via trap > > command > > >> and unblock the 8080/8443 > > >> > > >> Comments are welcome > > >> > > >> Regards, > > >> ilya > > >> > > > > > > > > > > > > -- > > > Rafael Weingärtner > > > > > > -- > > Andrija Panić >