Andrija This is a tough scenario.
As an admin, they way i would have handled this situation, is to advertise the upcoming outage and then take away specific API commands from a user a day before - so he does not cause any long running async jobs. Once maintenance completes - enable the API commands back to the user. However - i dont know who your user base is and if this would be an acceptable solution. Perhaps also investigate what can be done to speed up your long running tasks... As a side node, we will be working on a feature that would allow for a graceful termination of the process/job, meaning if agent noticed a disconnect or termination request - it will abort the command in flight. We can also consider restarting this tasks again or what not - but it would not be part of this enhancement. Regards ilya On Thu, Apr 5, 2018 at 6:47 AM, Andrija Panic <andrija.pa...@gmail.com> wrote: > Hi Ilya, > > thanks for the feedback - but in "real world", you need to "understand" > that 60min is next to useless timeout for some jobs (if I understand this > specific parameter correctly ?? - job is really canceled, not only job > monitoring is canceled ???) - > > My value for the "job.cancel.threshold.minutes" is 2880 minutes (2 days?) > > I can tell you when you have CEPH/NFS (CEPH even "worse" case, since slower > read durign qemu-img convert process...) of 500GB, then imagine snapshot > job will take many hours. Should I mention 1TB volumes (yes, we had > client's like that...) > Than attaching 1TB volume, that was uploaded to ACS (lives originally on > Secondary Storage, and takes time to be copied over to NFS/CEPH) will take > up to few hours. > Then migrating 1TB volume from NFS to CEPH, or CEPH to NFS, also takes > time...etc. > > I'm just giving you feedback as "user", admin of the cloud, zero DEV skills > here :) , just to make sure you make practical decisions (and I admit I > might be wrong with my stuff, but just giving you feedback from our public > cloud setup) > > > Cheers! > > > > > On 5 April 2018 at 15:16, Tutkowski, Mike <mike.tutkow...@netapp.com> > wrote: > > > Wow, there’s been a lot of good details noted from several people on how > > this process works today and how we’d like it to work in the near future. > > > > 1) Any chance this is already documented on the Wiki? > > > > 2) If not, any chance someone would be willing to do so (a flow diagram > > would be particularly useful). > > > > > On Apr 5, 2018, at 3:37 AM, Marc-Aurèle Brothier <ma...@exoscale.ch> > > wrote: > > > > > > Hi all, > > > > > > Good point ilya but as stated by Sergey there's more thing to consider > > > before being able to do a proper shutdown. I augmented my script I gave > > you > > > originally and changed code in CS. What we're doing for our environment > > is > > > as follow: > > > > > > 1. the MGMT looks for a change in the file /etc/lb-agent which contains > > > keywords for HAproxy[2] (ready, maint) so that HA-proxy can disable the > > > mgmt on the keyword "maint" and the mgmt server stops a couple of > > > threads[1] to stop processing async jobs in the queue > > > 2. Looks for the async jobs and wait until there is none to ensure you > > can > > > send the reconnect commands (if jobs are running, a reconnect will > result > > > in a failed job since the result will never reach the management > server - > > > the agent waits for the current job to be done before reconnecting, and > > > discard the result... rooms for improvement here!) > > > 3. Issue a reconnectHost command to all the hosts connected to the mgmt > > > server so that they reconnect to another one, otherwise the mgmt must > be > > up > > > since it is used to forward commands to agents. > > > 4. when all agents are reconnected, we can shutdown the management > server > > > and perform the maintenance. > > > > > > One issue remains for me, during the reconnect, the commands that are > > > processed at the same time should be kept in a queue until the agents > > have > > > finished any current jobs and have reconnected. Today the little time > > > window during which the reconnect happens can lead to failed jobs due > to > > > the agent not being connected at the right moment. > > > > > > I could push a PR for the change to stop some processing threads based > on > > > the content of a file. It's possible also to cancel the drain of the > > > management by simply changing the content of the file back to "ready" > > > again, instead of "maint" [2]. > > > > > > [1] AsyncJobMgr-Heartbeat, CapacityChecker, StatsCollector > > > [2] HA proxy documentation on agent checker: https://cbonte.github.io/ > > > haproxy-dconv/1.6/configuration.html#5.2-agent-check > > > > > > Regarding your issue on the port blocking, I think it's fair to > consider > > > that if you want to shutdown your server at some point, you have to > stop > > > serving (some) requests. Here the only way it's to stop serving > > everything. > > > If the API had a REST design, we could reject any POST/PUT/DELETE > > > operations and allow GET ones. I don't know how hard it would be today > to > > > only allow listBaseCmd operations to be more friendly with the users. > > > > > > Marco > > > > > > > > > On Thu, Apr 5, 2018 at 2:22 AM, Sergey Levitskiy <serg...@hotmail.com> > > > wrote: > > > > > >> Now without spellchecking :) > > >> > > >> This is not simple e.g. for VMware. Each management server also acts > as > > an > > >> agent proxy so tasks against a particular ESX host will be always > > >> forwarded. That right answer will be to support a native “maintenance > > mode” > > >> for management server. When entered to such mode the management server > > >> should release all agents including SSVM, block/redirect API calls and > > >> login request and finish all async job it originated. > > >> > > >> > > >> > > >> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy <serg...@hotmail.com > > <mailto: > > >> serg...@hotmail.com>> wrote: > > >> > > >> This is not simple e.g. for VMware. Each management server also acts > as > > an > > >> agent proxy so tasks against a particular ESX host will be always > > >> forwarded. That right answer will be to a native support for > > “maintenance > > >> mode” for management server. When entered to such mode the management > > >> server should release all agents including save, block/redirect API > > calls > > >> and login request and finish all a sync job it originated. > > >> > > >> Sent from my iPhone > > >> > > >> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner < > > >> rafaelweingart...@gmail.com<mailto:rafaelweingart...@gmail.com>> > wrote: > > >> > > >> Ilya, still regarding the management server that is being shut down > > issue; > > >> if other MSs/or maybe system VMs (I am not sure to know if they are > > able to > > >> do such tasks) can direct/redirect/send new jobs to this management > > server > > >> (the one being shut down), the process might never end because new > tasks > > >> are always being created for the management server that we want to > shut > > >> down. Is this scenario possible? > > >> > > >> That is why I mentioned blocking the port 8250 for the > > “graceful-shutdown”. > > >> > > >> If this scenario is not possible, then everything s fine. > > >> > > >> > > >> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev < > > ilya.mailing.li...@gmail.com > > >> <mailto:ilya.mailing.li...@gmail.com>> > > >> wrote: > > >> > > >> I'm thinking of using a configuration from > > "job.cancel.threshold.minutes" - > > >> it will be the longest > > >> > > >> "category": "Advanced", > > >> > > >> "description": "Time (in minutes) for async-jobs to be forcely > > >> cancelled if it has been in process for long", > > >> > > >> "name": "job.cancel.threshold.minutes", > > >> > > >> "value": "60" > > >> > > >> > > >> > > >> > > >> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner < > > >> rafaelweingart...@gmail.com<mailto:rafaelweingart...@gmail.com>> > wrote: > > >> > > >> Big +1 for this feature; I only have a few doubts. > > >> > > >> * Regarding the tasks/jobs that management servers (MSs) execute; are > > >> these > > >> tasks originate from requests that come to the MS, or is it possible > > that > > >> requests received by one management server to be executed by other? I > > >> mean, > > >> if I execute a request against MS1, will this request always be > > >> executed/threated by MS1, or is it possible that this request is > > executed > > >> by another MS (e.g. MS2)? > > >> > > >> * I would suggest that after we block traffic coming from > > >> 8080/8443/8250(we > > >> will need to block this as well right?), we can log the execution of > > >> tasks. > > >> I mean, something saying, there are XXX tasks (enumerate tasks) still > > >> being > > >> executed, we will wait for them to finish before shutting down. > > >> > > >> * The timeout (60 minutes suggested) could be global settings that we > > can > > >> load before executing the graceful-shutdown. > > >> > > >> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev < > > >> ilya.mailing.li...@gmail.com<mailto:ilya.mailing.li...@gmail.com> > > >> > > >> wrote: > > >> > > >> Use case: > > >> In any environment - time to time - administrator needs to perform a > > >> maintenance. Current stop sequence of cloudstack management server > will > > >> ignore the fact that there may be long running async jobs - and > > >> terminate > > >> the process. This in turn can create a poor user experience and > > >> occasional > > >> inconsistency in cloudstack db. > > >> > > >> This is especially painful in large environments where the user has > > >> thousands of nodes and there is a continuous patching that happens > > >> around > > >> the clock - that requires migration of workload from one node to > > >> another. > > >> > > >> With that said - i've created a script that monitors the async job > > >> queue > > >> for given MS and waits for it complete all jobs. More details are > > >> posted > > >> below. > > >> > > >> I'd like to introduce "graceful-shutdown" into the systemctl/service > of > > >> cloudstack-management service. > > >> > > >> The details of how it will work is below: > > >> > > >> Workflow for graceful shutdown: > > >> Using iptables/firewalld - block any connection attempts on 8080/8443 > > >> (we > > >> can identify the ports dynamically) > > >> Identify the MSID for the node, using the proper msid - query > > >> async_job > > >> table for > > >> 1) any jobs that are still running (or job_status=“0”) > > >> 2) job_dispatcher not like “pseudoJobDispatcher" > > >> 3) job_init_msid=$my_ms_id > > >> > > >> Monitor this async_job table for 60 minutes - until all async jobs for > > >> MSID > > >> are done, then proceed with shutdown > > >> If failed for any reason or terminated, catch the exit via trap > > >> command > > >> and unblock the 8080/8443 > > >> > > >> Comments are welcome > > >> > > >> Regards, > > >> ilya > > >> > > >> > > >> > > >> > > >> -- > > >> Rafael Weingärtner > > >> > > >> > > >> > > >> > > >> > > >> -- > > >> Rafael Weingärtner > > >> > > > > > > -- > > Andrija Panić >