Re: [DISCUSS] CloudStack graceful shutdown

Rafael Weingärtner Tue, 24 Apr 2018 07:26:33 -0700

Thanks for the feedback Ilya. Then, we would only need to adapt this new
feature introduced by you and ShapeBlue.



On Sat, Apr 21, 2018 at 4:03 PM, ilya musayev <ilya.mailing.li...@gmail.com>
wrote:

> Rafael
>
> What you are suggesting - was already implemented. We've created Load
> Balancing algorithms - but we did not take into account the LB algo for
> maintenance (yet). Rohit and ShapeBlue were the developers behind the
> feature.
>
> What needs to happen is a tweak to LB Algorithms to become MS maintenance
> aware - or create new LB Algos altogether. Essentially we need to merge
> your work and this feature. Please read the FS below.
>
> Functional Spec:
>
>
> The new CA framework introduced basic support for comma-separated
> list of management servers for agent, which makes an external LB
> unnecessary.
>
> This extends that feature to implement LB sorting algorithms that
> sorts the management server list before they are sent to the agents.
> This adds a central intelligence in the management server and adds
> additional enhancements to Agent class to be algorithm aware and
> have a background mechanism to check/fallback to preferred management
> server (assumed as the first in the list). This is support for any
> indirect agent such as the KVM, CPVM and SSVM agent, and would
> provide support for management server host migration during upgrade
> (when instead of in-place, new hosts are used to setup new mgmt server).
>
> This FR introduces two new global settings:
>
>    - indirect.agent.lb.algorithm: The algorithm for the indirect agent LB.
>    - indirect.agent.lb.check.interval: The preferred host check interval
>    for the agent's background task that checks and switches to agent's
>    preferred host.
>
> The indirect.agent.lb.algorithm supports following algorithm options:
>
>    - static: use the list as provided.
>    - roundrobin: evenly spreads hosts across management servers based on
>    host's id.
>    - shuffle: (pseudo) randomly sorts the list (not recommended for
>    production).
>
> Any changes to the global settings - indirect.agent.lb.algorithm and
> host does not require restarting of the mangement server(s) and the
> agents. A message bus based system dynamically reacts to change in these
> global settings and propagates them to all connected agents.
>
> Comma-separated management server list is propagated to agents on
> following cases:
>
>    - Addition of a host (including ssvm, cpvm systevms).
>    - Connection or reconnection by the agents to a management server.
>    - After admin changes the 'host' and/or the
>    'indirect.agent.lb.algorithm' global settings.
>
> On the agent side, the 'host' setting is saved in its properties file as:
> host=<comma separated addresses>@<algorithm name>.
>
> First the agent connects to the management server and sends its current
> management server list, which is compared by the management server and
> in case of failure a new/update list is sent for the agent to persist.
>
> From the agent's perspective, the first address in the propagated list
> will be considered the preferred host. A new background task can be
> activated by configuring the indirect.agent.lb.check.interval which is
> a cluster level global setting from CloudStack and admins can also
> override this by configuring the 'host.lb.check.interval' in the
> agent.properties file.
>
> Every time agent gets a ms-host list and the algorithm, the host specific
> background check interval is also sent and it dynamically reconfigures
> the background task without need to restart agents.
>
> Note: The 'static' and 'roundrobin' algorithms, strictly checks for the
> order as expected by them, however, the 'shuffle' algorithm just checks
> for content and not the order of the comma separate ms host addresses.
>
> Regards
> ilya
>
>
> On Fri, Apr 20, 2018 at 1:01 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> > Is that management server load balancing feature using static
> > configurations? I heard about it on the mailing list, but I did not
> follow
> > the implementation.
> >
> > I do not see many problems with agents reconnecting. We can implement in
> > agents (not just KVM, but also system VMs) a logic that instead of using
> a
> > static pool of management servers configured in a properties file, they
> > dynamically request a list of available management servers via that list
> > management servers API method. This would require us to configure agents
> > with a load balancer URL that executes the balancing between multiple
> > management servers.
> >
> > I am +1 to remove the need for that VIP, which executes the load balance
> > for connecting agents to management servers.
> >
> > On Fri, Apr 20, 2018 at 4:41 PM, ilya musayev <
> > ilya.mailing.li...@gmail.com>
> > wrote:
> >
> > > Rafael and Community
> > >
> > > All is well and good and i think we are thinking along the similar
> lines
> > -
> > > the only issue that i see right now with any approach is KVM Agents (or
> > > direct agents) and using LoadBalancer on 8250.
> > >
> > > Here is a scenario:
> > >
> > > You have 2 Management Server setup fronted with a VIP on 8250.
> > > The LB Algorithm is either Round Robin or Least Connections used.
> > > You initiate a maintenance mode operation on one of the MS servers
> (call
> > it
> > > MS1) - assume you have a long running migration job that needs 60
> minutes
> > > to complete.
> > > We attempt to evacuate the agents by telling them to disconnect and
> > > reconnect again
> > > If we are using LB on 8250 with
> > > 1) Least Connection used - then all agents will continuously try to
> > connect
> > > to a MS1 node that is attempting to go down for maintenance.
> Essentially
> > > with this  LB configuration this operation will never
> > > 2) Round Robin - this will take a while - but eventually - you will get
> > all
> > > nodes connected to MS2
> > >
> > > The current limitation is usage of external LB on 8250. For this
> > operation
> > > to work without issue - would mean agents must connect to MS server
> > without
> > > an LB. This is a recent feature we've developed with ShapeBlue - where
> we
> > > maintain the list of CloudStack Management Servers in the
> > agent.properties
> > > file.
> > >
> > > Unless you can think of other solution - it appears we may have to
> forced
> > > to bypass the 8250 VIP LB and use the new feature to maintain the list
> of
> > > management servers within agent.properties.
> > >
> > >
> > > I need to run now, let me know what your thoughts are.
> > >
> > > Regards
> > > ilya
> > >
> > >
> > >
> > > On Tue, Apr 17, 2018 at 8:27 AM, Rafael Weingärtner <
> > > rafaelweingart...@gmail.com> wrote:
> > >
> > > > Ilya and others,
> > > >
> > > > We have been discussing this idea of graceful/nicely shutdown.  Our
> > > feeling
> > > > is that we (in CloudStack community) might have been trying to solve
> > this
> > > > problem with too much scripting. What if we developed a more
> integrated
> > > > (native) solution?
> > > >
> > > > Let me explain our idea.
> > > >
> > > > ACS has a table called “mshost”, which is used to store management
> > server
> > > > information. During balancing and when jobs are dispatched to other
> > > > management servers this table is consulted/queried.  Therefore, we
> have
> > > > been discussing the idea of creating a management API for management
> > > > servers.  We could have an API method that changes the state of
> > > management
> > > > servers to “prepare to maintenance” and then “maintenance” (as soon
> as
> > > all
> > > > of the task/jobs it is managing finish). The idea is that during
> > > > rebalancing we would remove the hosts of servers that are not in “Up”
> > > state
> > > > (of course we would also ignore hosts in the aforementioned state to
> > > > receive hosts to manage).  Moreover, when we send/dispatch jobs to
> > other
> > > > management servers, we could ignore the ones that are not in “Up”
> state
> > > > (which is something already done).
> > > >
> > > > By doing this, the nicely shutdown could be executed in a few steps.
> > > >
> > > > 1 – issue the maintenance method for the management server you desire
> > > > 2 – wait until the MS goes into maintenance mode, while there are
> still
> > > > running jobs it (the management server) will be maintained in prepare
> > for
> > > > maintenance
> > > > 3 – execute the Linux shutdown command
> > > >
> > > > We would need other APIs methods to manage MSs then. An (i) API
> method
> > to
> > > > list MSs, and we could even create an (ii) API to remove
> > old/de-activated
> > > > management servers, which we currently do not have (forcing users to
> > > apply
> > > > changed directly in the database).
> > > >
> > > > Moreover, in this model, we would not kill hanging jobs; we would
> wait
> > > > until they expire and ACS expunges them. Of course, it is possible to
> > > > develop a forceful maintenance method as well. Then, when the
> “prepare
> > > for
> > > > maintenance” takes longer than a parameter, we could kill hanging
> jobs.
> > > >
> > > > All of this would allow the MS to be kept up and receiving requests
> > until
> > > > it can be safely shutdown. What do you guys about this approach?
> > > >
> > > > On Tue, Apr 10, 2018 at 6:52 PM, Yiping Zhang <yzh...@marketo.com>
> > > wrote:
> > > >
> > > > > As a cloud admin, I would love to have this feature.
> > > > >
> > > > > It so happens that I just accidentally restarted my ACS management
> > > server
> > > > > while two instances are migrating to another Xen cluster (via
> storage
> > > > > migration, not live migration).  As results, both instances
> > > > > ends up with corrupted data disk which can't be reattached or
> > migrated.
> > > > >
> > > > > Any feature which prevents this from happening would be great.  A
> low
> > > > > hanging fruit is simply checking for
> > > > > if there are any async jobs running, especially any kind of
> migration
> > > > jobs
> > > > > or other known long running type of
> > > > > jobs and warn the operator  so that he has a chance to abort server
> > > > > shutdowns.
> > > > >
> > > > > Yiping
> > > > >
> > > > > On 4/5/18, 3:13 PM, "ilya musayev" <ilya.mailing.li...@gmail.com>
> > > > wrote:
> > > > >
> > > > >     Andrija
> > > > >
> > > > >     This is a tough scenario.
> > > > >
> > > > >     As an admin, they way i would have handled this situation, is
> to
> > > > > advertise
> > > > >     the upcoming outage and then take away specific API commands
> > from a
> > > > > user a
> > > > >     day before - so he does not cause any long running async jobs.
> > Once
> > > > >     maintenance completes - enable the API commands back to the
> user.
> > > > > However -
> > > > >     i dont know who your user base is and if this would be an
> > > acceptable
> > > > >     solution.
> > > > >
> > > > >     Perhaps also investigate what can be done to speed up your long
> > > > running
> > > > >     tasks...
> > > > >
> > > > >     As a side node, we will be working on a feature that would
> allow
> > > for
> > > > a
> > > > >     graceful termination of the process/job, meaning if agent
> > noticed a
> > > > >     disconnect or termination request - it will abort the command
> in
> > > > > flight. We
> > > > >     can also consider restarting this tasks again or what not - but
> > it
> > > > > would
> > > > >     not be part of this enhancement.
> > > > >
> > > > >     Regards
> > > > >     ilya
> > > > >
> > > > >     On Thu, Apr 5, 2018 at 6:47 AM, Andrija Panic <
> > > > andrija.pa...@gmail.com
> > > > > >
> > > > >     wrote:
> > > > >
> > > > >     > Hi Ilya,
> > > > >     >
> > > > >     > thanks for the feedback - but in "real world", you need to
> > > > > "understand"
> > > > >     > that 60min is next to useless timeout for some jobs (if I
> > > > understand
> > > > > this
> > > > >     > specific parameter correctly ?? - job is really canceled, not
> > > only
> > > > > job
> > > > >     > monitoring is canceled ???) -
> > > > >     >
> > > > >     > My value for the  "job.cancel.threshold.minutes" is 2880
> > minutes
> > > (2
> > > > > days?)
> > > > >     >
> > > > >     > I can tell you when you have CEPH/NFS (CEPH even "worse"
> case,
> > > > since
> > > > > slower
> > > > >     > read durign qemu-img convert process...) of 500GB, then
> imagine
> > > > > snapshot
> > > > >     > job will take many hours. Should I mention 1TB volumes (yes,
> we
> > > had
> > > > >     > client's like that...)
> > > > >     > Than attaching 1TB volume, that was uploaded to ACS (lives
> > > > > originally on
> > > > >     > Secondary Storage, and takes time to be copied over to
> > NFS/CEPH)
> > > > > will take
> > > > >     > up to few hours.
> > > > >     > Then migrating 1TB volume from NFS to CEPH, or CEPH to NFS,
> > also
> > > > > takes
> > > > >     > time...etc.
> > > > >     >
> > > > >     > I'm just giving you feedback as "user", admin of the cloud,
> > zero
> > > > DEV
> > > > > skills
> > > > >     > here :) , just to make sure you make practical decisions
> (and I
> > > > > admit I
> > > > >     > might be wrong with my stuff, but just giving you feedback
> from
> > > our
> > > > > public
> > > > >     > cloud setup)
> > > > >     >
> > > > >     >
> > > > >     > Cheers!
> > > > >     >
> > > > >     >
> > > > >     >
> > > > >     >
> > > > >     > On 5 April 2018 at 15:16, Tutkowski, Mike <
> > > > mike.tutkow...@netapp.com
> > > > > >
> > > > >     > wrote:
> > > > >     >
> > > > >     > > Wow, there’s been a lot of good details noted from several
> > > people
> > > > > on how
> > > > >     > > this process works today and how we’d like it to work in
> the
> > > near
> > > > > future.
> > > > >     > >
> > > > >     > > 1) Any chance this is already documented on the Wiki?
> > > > >     > >
> > > > >     > > 2) If not, any chance someone would be willing to do so (a
> > flow
> > > > > diagram
> > > > >     > > would be particularly useful).
> > > > >     > >
> > > > >     > > > On Apr 5, 2018, at 3:37 AM, Marc-Aurèle Brothier <
> > > > > ma...@exoscale.ch>
> > > > >     > > wrote:
> > > > >     > > >
> > > > >     > > > Hi all,
> > > > >     > > >
> > > > >     > > > Good point ilya but as stated by Sergey there's more
> thing
> > to
> > > > > consider
> > > > >     > > > before being able to do a proper shutdown. I augmented my
> > > > script
> > > > > I gave
> > > > >     > > you
> > > > >     > > > originally and changed code in CS. What we're doing for
> our
> > > > > environment
> > > > >     > > is
> > > > >     > > > as follow:
> > > > >     > > >
> > > > >     > > > 1. the MGMT looks for a change in the file /etc/lb-agent
> > > which
> > > > > contains
> > > > >     > > > keywords for HAproxy[2] (ready, maint) so that HA-proxy
> can
> > > > > disable the
> > > > >     > > > mgmt on the keyword "maint" and the mgmt server stops a
> > > couple
> > > > of
> > > > >     > > > threads[1] to stop processing async jobs in the queue
> > > > >     > > > 2. Looks for the async jobs and wait until there is none
> to
> > > > > ensure you
> > > > >     > > can
> > > > >     > > > send the reconnect commands (if jobs are running, a
> > reconnect
> > > > > will
> > > > >     > result
> > > > >     > > > in a failed job since the result will never reach the
> > > > management
> > > > >     > server -
> > > > >     > > > the agent waits for the current job to be done before
> > > > > reconnecting, and
> > > > >     > > > discard the result... rooms for improvement here!)
> > > > >     > > > 3. Issue a reconnectHost command to all the hosts
> connected
> > > to
> > > > > the mgmt
> > > > >     > > > server so that they reconnect to another one, otherwise
> the
> > > > mgmt
> > > > > must
> > > > >     > be
> > > > >     > > up
> > > > >     > > > since it is used to forward commands to agents.
> > > > >     > > > 4. when all agents are reconnected, we can shutdown the
> > > > > management
> > > > >     > server
> > > > >     > > > and perform the maintenance.
> > > > >     > > >
> > > > >     > > > One issue remains for me, during the reconnect, the
> > commands
> > > > > that are
> > > > >     > > > processed at the same time should be kept in a queue
> until
> > > the
> > > > > agents
> > > > >     > > have
> > > > >     > > > finished any current jobs and have reconnected. Today the
> > > > little
> > > > > time
> > > > >     > > > window during which the reconnect happens can lead to
> > failed
> > > > > jobs due
> > > > >     > to
> > > > >     > > > the agent not being connected at the right moment.
> > > > >     > > >
> > > > >     > > > I could push a PR for the change to stop some processing
> > > > threads
> > > > > based
> > > > >     > on
> > > > >     > > > the content of a file. It's possible also to cancel the
> > drain
> > > > of
> > > > > the
> > > > >     > > > management by simply changing the content of the file
> back
> > to
> > > > > "ready"
> > > > >     > > > again, instead of "maint" [2].
> > > > >     > > >
> > > > >     > > > [1] AsyncJobMgr-Heartbeat, CapacityChecker,
> StatsCollector
> > > > >     > > > [2] HA proxy documentation on agent checker:
> > > > > https://cbonte.github.io/
> > > > >     > > > haproxy-dconv/1.6/configuration.html#5.2-agent-check
> > > > >     > > >
> > > > >     > > > Regarding your issue on the port blocking, I think it's
> > fair
> > > to
> > > > >     > consider
> > > > >     > > > that if you want to shutdown your server at some point,
> you
> > > > have
> > > > > to
> > > > >     > stop
> > > > >     > > > serving (some) requests. Here the only way it's to stop
> > > serving
> > > > >     > > everything.
> > > > >     > > > If the API had a REST design, we could reject any
> > > > POST/PUT/DELETE
> > > > >     > > > operations and allow GET ones. I don't know how hard it
> > would
> > > > be
> > > > > today
> > > > >     > to
> > > > >     > > > only allow listBaseCmd operations to be more friendly
> with
> > > the
> > > > > users.
> > > > >     > > >
> > > > >     > > > Marco
> > > > >     > > >
> > > > >     > > >
> > > > >     > > > On Thu, Apr 5, 2018 at 2:22 AM, Sergey Levitskiy <
> > > > > serg...@hotmail.com>
> > > > >     > > > wrote:
> > > > >     > > >
> > > > >     > > >> Now without spellchecking :)
> > > > >     > > >>
> > > > >     > > >> This is not simple e.g. for VMware. Each management
> server
> > > > also
> > > > > acts
> > > > >     > as
> > > > >     > > an
> > > > >     > > >> agent proxy so tasks against a particular ESX host will
> be
> > > > > always
> > > > >     > > >> forwarded. That right answer will be to support a native
> > > > > “maintenance
> > > > >     > > mode”
> > > > >     > > >> for management server. When entered to such mode the
> > > > management
> > > > > server
> > > > >     > > >> should release all agents including SSVM, block/redirect
> > API
> > > > > calls and
> > > > >     > > >> login request and finish all async job it originated.
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy <
> > > > > serg...@hotmail.com
> > > > >     > > <mailto:
> > > > >     > > >> serg...@hotmail.com>> wrote:
> > > > >     > > >>
> > > > >     > > >> This is not simple e.g. for VMware. Each management
> server
> > > > also
> > > > > acts
> > > > >     > as
> > > > >     > > an
> > > > >     > > >> agent proxy so tasks against a particular ESX host will
> be
> > > > > always
> > > > >     > > >> forwarded. That right answer will be to a native support
> > for
> > > > >     > > “maintenance
> > > > >     > > >> mode” for management server. When entered to such mode
> the
> > > > > management
> > > > >     > > >> server should release all agents including save,
> > > > block/redirect
> > > > > API
> > > > >     > > calls
> > > > >     > > >> and login request and finish all a sync job it
> originated.
> > > > >     > > >>
> > > > >     > > >> Sent from my iPhone
> > > > >     > > >>
> > > > >     > > >> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <
> > > > >     > > >> rafaelweingart...@gmail.com<mailto:rafaelweingartner@
> > > > gmail.com
> > > > > >>
> > > > >     > wrote:
> > > > >     > > >>
> > > > >     > > >> Ilya, still regarding the management server that is
> being
> > > shut
> > > > > down
> > > > >     > > issue;
> > > > >     > > >> if other MSs/or maybe system VMs (I am not sure to know
> if
> > > > they
> > > > > are
> > > > >     > > able to
> > > > >     > > >> do such tasks) can direct/redirect/send new jobs to this
> > > > > management
> > > > >     > > server
> > > > >     > > >> (the one being shut down), the process might never end
> > > because
> > > > > new
> > > > >     > tasks
> > > > >     > > >> are always being created for the management server that
> we
> > > > want
> > > > > to
> > > > >     > shut
> > > > >     > > >> down. Is this scenario possible?
> > > > >     > > >>
> > > > >     > > >> That is why I mentioned blocking the port 8250 for the
> > > > >     > > “graceful-shutdown”.
> > > > >     > > >>
> > > > >     > > >> If this scenario is not possible, then everything s
> fine.
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev <
> > > > >     > > ilya.mailing.li...@gmail.com
> > > > >     > > >> <mailto:ilya.mailing.li...@gmail.com>>
> > > > >     > > >> wrote:
> > > > >     > > >>
> > > > >     > > >> I'm thinking of using a configuration from
> > > > >     > > "job.cancel.threshold.minutes" -
> > > > >     > > >> it will be the longest
> > > > >     > > >>
> > > > >     > > >>    "category": "Advanced",
> > > > >     > > >>
> > > > >     > > >>    "description": "Time (in minutes) for async-jobs to
> be
> > > > > forcely
> > > > >     > > >> cancelled if it has been in process for long",
> > > > >     > > >>
> > > > >     > > >>    "name": "job.cancel.threshold.minutes",
> > > > >     > > >>
> > > > >     > > >>    "value": "60"
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
> > > > >     > > >> rafaelweingart...@gmail.com<mailto:rafaelweingartner@
> > > > gmail.com
> > > > > >>
> > > > >     > wrote:
> > > > >     > > >>
> > > > >     > > >> Big +1 for this feature; I only have a few doubts.
> > > > >     > > >>
> > > > >     > > >> * Regarding the tasks/jobs that management servers (MSs)
> > > > > execute; are
> > > > >     > > >> these
> > > > >     > > >> tasks originate from requests that come to the MS, or is
> > it
> > > > > possible
> > > > >     > > that
> > > > >     > > >> requests received by one management server to be
> executed
> > by
> > > > > other? I
> > > > >     > > >> mean,
> > > > >     > > >> if I execute a request against MS1, will this request
> > always
> > > > be
> > > > >     > > >> executed/threated by MS1, or is it possible that this
> > > request
> > > > is
> > > > >     > > executed
> > > > >     > > >> by another MS (e.g. MS2)?
> > > > >     > > >>
> > > > >     > > >> * I would suggest that after we block traffic coming
> from
> > > > >     > > >> 8080/8443/8250(we
> > > > >     > > >> will need to block this as well right?), we can log the
> > > > > execution of
> > > > >     > > >> tasks.
> > > > >     > > >> I mean, something saying, there are XXX tasks (enumerate
> > > > tasks)
> > > > > still
> > > > >     > > >> being
> > > > >     > > >> executed, we will wait for them to finish before
> shutting
> > > > down.
> > > > >     > > >>
> > > > >     > > >> * The timeout (60 minutes suggested) could be global
> > > settings
> > > > > that we
> > > > >     > > can
> > > > >     > > >> load before executing the graceful-shutdown.
> > > > >     > > >>
> > > > >     > > >> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <
> > > > >     > > >> ilya.mailing.li...@gmail.com<mailto:ilya.mailing.lists@
> > > > > gmail.com>
> > > > >     > > >>
> > > > >     > > >> wrote:
> > > > >     > > >>
> > > > >     > > >> Use case:
> > > > >     > > >> In any environment - time to time - administrator needs
> to
> > > > > perform a
> > > > >     > > >> maintenance. Current stop sequence of cloudstack
> > management
> > > > > server
> > > > >     > will
> > > > >     > > >> ignore the fact that there may be long running async
> jobs
> > -
> > > > and
> > > > >     > > >> terminate
> > > > >     > > >> the process. This in turn can create a poor user
> > experience
> > > > and
> > > > >     > > >> occasional
> > > > >     > > >> inconsistency  in cloudstack db.
> > > > >     > > >>
> > > > >     > > >> This is especially painful in large environments where
> the
> > > > user
> > > > > has
> > > > >     > > >> thousands of nodes and there is a continuous patching
> that
> > > > > happens
> > > > >     > > >> around
> > > > >     > > >> the clock - that requires migration of workload from one
> > > node
> > > > to
> > > > >     > > >> another.
> > > > >     > > >>
> > > > >     > > >> With that said - i've created a script that monitors the
> > > async
> > > > > job
> > > > >     > > >> queue
> > > > >     > > >> for given MS and waits for it complete all jobs. More
> > > details
> > > > > are
> > > > >     > > >> posted
> > > > >     > > >> below.
> > > > >     > > >>
> > > > >     > > >> I'd like to introduce "graceful-shutdown" into the
> > > > > systemctl/service
> > > > >     > of
> > > > >     > > >> cloudstack-management service.
> > > > >     > > >>
> > > > >     > > >> The details of how it will work is below:
> > > > >     > > >>
> > > > >     > > >> Workflow for graceful shutdown:
> > > > >     > > >> Using iptables/firewalld - block any connection attempts
> > on
> > > > > 8080/8443
> > > > >     > > >> (we
> > > > >     > > >> can identify the ports dynamically)
> > > > >     > > >> Identify the MSID for the node, using the proper msid -
> > > query
> > > > >     > > >> async_job
> > > > >     > > >> table for
> > > > >     > > >> 1) any jobs that are still running (or job_status=“0”)
> > > > >     > > >> 2) job_dispatcher not like “pseudoJobDispatcher"
> > > > >     > > >> 3) job_init_msid=$my_ms_id
> > > > >     > > >>
> > > > >     > > >> Monitor this async_job table for 60 minutes - until all
> > > async
> > > > > jobs for
> > > > >     > > >> MSID
> > > > >     > > >> are done, then proceed with shutdown
> > > > >     > > >>  If failed for any reason or terminated, catch the exit
> > via
> > > > trap
> > > > >     > > >> command
> > > > >     > > >> and unblock the 8080/8443
> > > > >     > > >>
> > > > >     > > >> Comments are welcome
> > > > >     > > >>
> > > > >     > > >> Regards,
> > > > >     > > >> ilya
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >> --
> > > > >     > > >> Rafael Weingärtner
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >>
> > > > >     > > >> --
> > > > >     > > >> Rafael Weingärtner
> > > > >     > > >>
> > > > >     > >
> > > > >     >
> > > > >     >
> > > > >     >
> > > > >     > --
> > > > >     >
> > > > >     > Andrija Panić
> > > > >     >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Rafael Weingärtner
> > > >
> > >
> >
> >
> >
> > --
> > Rafael Weingärtner
> >
>



-- 
Rafael Weingärtner

Re: [DISCUSS] CloudStack graceful shutdown

Reply via email to