Thanks for the feedback Ilya. Then, we would only need to adapt this new feature introduced by you and ShapeBlue.
On Sat, Apr 21, 2018 at 4:03 PM, ilya musayev <ilya.mailing.li...@gmail.com> wrote: > Rafael > > What you are suggesting - was already implemented. We've created Load > Balancing algorithms - but we did not take into account the LB algo for > maintenance (yet). Rohit and ShapeBlue were the developers behind the > feature. > > What needs to happen is a tweak to LB Algorithms to become MS maintenance > aware - or create new LB Algos altogether. Essentially we need to merge > your work and this feature. Please read the FS below. > > Functional Spec: > > > The new CA framework introduced basic support for comma-separated > list of management servers for agent, which makes an external LB > unnecessary. > > This extends that feature to implement LB sorting algorithms that > sorts the management server list before they are sent to the agents. > This adds a central intelligence in the management server and adds > additional enhancements to Agent class to be algorithm aware and > have a background mechanism to check/fallback to preferred management > server (assumed as the first in the list). This is support for any > indirect agent such as the KVM, CPVM and SSVM agent, and would > provide support for management server host migration during upgrade > (when instead of in-place, new hosts are used to setup new mgmt server). > > This FR introduces two new global settings: > > - indirect.agent.lb.algorithm: The algorithm for the indirect agent LB. > - indirect.agent.lb.check.interval: The preferred host check interval > for the agent's background task that checks and switches to agent's > preferred host. > > The indirect.agent.lb.algorithm supports following algorithm options: > > - static: use the list as provided. > - roundrobin: evenly spreads hosts across management servers based on > host's id. > - shuffle: (pseudo) randomly sorts the list (not recommended for > production). > > Any changes to the global settings - indirect.agent.lb.algorithm and > host does not require restarting of the mangement server(s) and the > agents. A message bus based system dynamically reacts to change in these > global settings and propagates them to all connected agents. > > Comma-separated management server list is propagated to agents on > following cases: > > - Addition of a host (including ssvm, cpvm systevms). > - Connection or reconnection by the agents to a management server. > - After admin changes the 'host' and/or the > 'indirect.agent.lb.algorithm' global settings. > > On the agent side, the 'host' setting is saved in its properties file as: > host=<comma separated addresses>@<algorithm name>. > > First the agent connects to the management server and sends its current > management server list, which is compared by the management server and > in case of failure a new/update list is sent for the agent to persist. > > From the agent's perspective, the first address in the propagated list > will be considered the preferred host. A new background task can be > activated by configuring the indirect.agent.lb.check.interval which is > a cluster level global setting from CloudStack and admins can also > override this by configuring the 'host.lb.check.interval' in the > agent.properties file. > > Every time agent gets a ms-host list and the algorithm, the host specific > background check interval is also sent and it dynamically reconfigures > the background task without need to restart agents. > > Note: The 'static' and 'roundrobin' algorithms, strictly checks for the > order as expected by them, however, the 'shuffle' algorithm just checks > for content and not the order of the comma separate ms host addresses. > > Regards > ilya > > > On Fri, Apr 20, 2018 at 1:01 PM, Rafael Weingärtner < > rafaelweingart...@gmail.com> wrote: > > > Is that management server load balancing feature using static > > configurations? I heard about it on the mailing list, but I did not > follow > > the implementation. > > > > I do not see many problems with agents reconnecting. We can implement in > > agents (not just KVM, but also system VMs) a logic that instead of using > a > > static pool of management servers configured in a properties file, they > > dynamically request a list of available management servers via that list > > management servers API method. This would require us to configure agents > > with a load balancer URL that executes the balancing between multiple > > management servers. > > > > I am +1 to remove the need for that VIP, which executes the load balance > > for connecting agents to management servers. > > > > On Fri, Apr 20, 2018 at 4:41 PM, ilya musayev < > > ilya.mailing.li...@gmail.com> > > wrote: > > > > > Rafael and Community > > > > > > All is well and good and i think we are thinking along the similar > lines > > - > > > the only issue that i see right now with any approach is KVM Agents (or > > > direct agents) and using LoadBalancer on 8250. > > > > > > Here is a scenario: > > > > > > You have 2 Management Server setup fronted with a VIP on 8250. > > > The LB Algorithm is either Round Robin or Least Connections used. > > > You initiate a maintenance mode operation on one of the MS servers > (call > > it > > > MS1) - assume you have a long running migration job that needs 60 > minutes > > > to complete. > > > We attempt to evacuate the agents by telling them to disconnect and > > > reconnect again > > > If we are using LB on 8250 with > > > 1) Least Connection used - then all agents will continuously try to > > connect > > > to a MS1 node that is attempting to go down for maintenance. > Essentially > > > with this LB configuration this operation will never > > > 2) Round Robin - this will take a while - but eventually - you will get > > all > > > nodes connected to MS2 > > > > > > The current limitation is usage of external LB on 8250. For this > > operation > > > to work without issue - would mean agents must connect to MS server > > without > > > an LB. This is a recent feature we've developed with ShapeBlue - where > we > > > maintain the list of CloudStack Management Servers in the > > agent.properties > > > file. > > > > > > Unless you can think of other solution - it appears we may have to > forced > > > to bypass the 8250 VIP LB and use the new feature to maintain the list > of > > > management servers within agent.properties. > > > > > > > > > I need to run now, let me know what your thoughts are. > > > > > > Regards > > > ilya > > > > > > > > > > > > On Tue, Apr 17, 2018 at 8:27 AM, Rafael Weingärtner < > > > rafaelweingart...@gmail.com> wrote: > > > > > > > Ilya and others, > > > > > > > > We have been discussing this idea of graceful/nicely shutdown. Our > > > feeling > > > > is that we (in CloudStack community) might have been trying to solve > > this > > > > problem with too much scripting. What if we developed a more > integrated > > > > (native) solution? > > > > > > > > Let me explain our idea. > > > > > > > > ACS has a table called “mshost”, which is used to store management > > server > > > > information. During balancing and when jobs are dispatched to other > > > > management servers this table is consulted/queried. Therefore, we > have > > > > been discussing the idea of creating a management API for management > > > > servers. We could have an API method that changes the state of > > > management > > > > servers to “prepare to maintenance” and then “maintenance” (as soon > as > > > all > > > > of the task/jobs it is managing finish). The idea is that during > > > > rebalancing we would remove the hosts of servers that are not in “Up” > > > state > > > > (of course we would also ignore hosts in the aforementioned state to > > > > receive hosts to manage). Moreover, when we send/dispatch jobs to > > other > > > > management servers, we could ignore the ones that are not in “Up” > state > > > > (which is something already done). > > > > > > > > By doing this, the nicely shutdown could be executed in a few steps. > > > > > > > > 1 – issue the maintenance method for the management server you desire > > > > 2 – wait until the MS goes into maintenance mode, while there are > still > > > > running jobs it (the management server) will be maintained in prepare > > for > > > > maintenance > > > > 3 – execute the Linux shutdown command > > > > > > > > We would need other APIs methods to manage MSs then. An (i) API > method > > to > > > > list MSs, and we could even create an (ii) API to remove > > old/de-activated > > > > management servers, which we currently do not have (forcing users to > > > apply > > > > changed directly in the database). > > > > > > > > Moreover, in this model, we would not kill hanging jobs; we would > wait > > > > until they expire and ACS expunges them. Of course, it is possible to > > > > develop a forceful maintenance method as well. Then, when the > “prepare > > > for > > > > maintenance” takes longer than a parameter, we could kill hanging > jobs. > > > > > > > > All of this would allow the MS to be kept up and receiving requests > > until > > > > it can be safely shutdown. What do you guys about this approach? > > > > > > > > On Tue, Apr 10, 2018 at 6:52 PM, Yiping Zhang <yzh...@marketo.com> > > > wrote: > > > > > > > > > As a cloud admin, I would love to have this feature. > > > > > > > > > > It so happens that I just accidentally restarted my ACS management > > > server > > > > > while two instances are migrating to another Xen cluster (via > storage > > > > > migration, not live migration). As results, both instances > > > > > ends up with corrupted data disk which can't be reattached or > > migrated. > > > > > > > > > > Any feature which prevents this from happening would be great. A > low > > > > > hanging fruit is simply checking for > > > > > if there are any async jobs running, especially any kind of > migration > > > > jobs > > > > > or other known long running type of > > > > > jobs and warn the operator so that he has a chance to abort server > > > > > shutdowns. > > > > > > > > > > Yiping > > > > > > > > > > On 4/5/18, 3:13 PM, "ilya musayev" <ilya.mailing.li...@gmail.com> > > > > wrote: > > > > > > > > > > Andrija > > > > > > > > > > This is a tough scenario. > > > > > > > > > > As an admin, they way i would have handled this situation, is > to > > > > > advertise > > > > > the upcoming outage and then take away specific API commands > > from a > > > > > user a > > > > > day before - so he does not cause any long running async jobs. > > Once > > > > > maintenance completes - enable the API commands back to the > user. > > > > > However - > > > > > i dont know who your user base is and if this would be an > > > acceptable > > > > > solution. > > > > > > > > > > Perhaps also investigate what can be done to speed up your long > > > > running > > > > > tasks... > > > > > > > > > > As a side node, we will be working on a feature that would > allow > > > for > > > > a > > > > > graceful termination of the process/job, meaning if agent > > noticed a > > > > > disconnect or termination request - it will abort the command > in > > > > > flight. We > > > > > can also consider restarting this tasks again or what not - but > > it > > > > > would > > > > > not be part of this enhancement. > > > > > > > > > > Regards > > > > > ilya > > > > > > > > > > On Thu, Apr 5, 2018 at 6:47 AM, Andrija Panic < > > > > andrija.pa...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > Hi Ilya, > > > > > > > > > > > > thanks for the feedback - but in "real world", you need to > > > > > "understand" > > > > > > that 60min is next to useless timeout for some jobs (if I > > > > understand > > > > > this > > > > > > specific parameter correctly ?? - job is really canceled, not > > > only > > > > > job > > > > > > monitoring is canceled ???) - > > > > > > > > > > > > My value for the "job.cancel.threshold.minutes" is 2880 > > minutes > > > (2 > > > > > days?) > > > > > > > > > > > > I can tell you when you have CEPH/NFS (CEPH even "worse" > case, > > > > since > > > > > slower > > > > > > read durign qemu-img convert process...) of 500GB, then > imagine > > > > > snapshot > > > > > > job will take many hours. Should I mention 1TB volumes (yes, > we > > > had > > > > > > client's like that...) > > > > > > Than attaching 1TB volume, that was uploaded to ACS (lives > > > > > originally on > > > > > > Secondary Storage, and takes time to be copied over to > > NFS/CEPH) > > > > > will take > > > > > > up to few hours. > > > > > > Then migrating 1TB volume from NFS to CEPH, or CEPH to NFS, > > also > > > > > takes > > > > > > time...etc. > > > > > > > > > > > > I'm just giving you feedback as "user", admin of the cloud, > > zero > > > > DEV > > > > > skills > > > > > > here :) , just to make sure you make practical decisions > (and I > > > > > admit I > > > > > > might be wrong with my stuff, but just giving you feedback > from > > > our > > > > > public > > > > > > cloud setup) > > > > > > > > > > > > > > > > > > Cheers! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 5 April 2018 at 15:16, Tutkowski, Mike < > > > > mike.tutkow...@netapp.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Wow, there’s been a lot of good details noted from several > > > people > > > > > on how > > > > > > > this process works today and how we’d like it to work in > the > > > near > > > > > future. > > > > > > > > > > > > > > 1) Any chance this is already documented on the Wiki? > > > > > > > > > > > > > > 2) If not, any chance someone would be willing to do so (a > > flow > > > > > diagram > > > > > > > would be particularly useful). > > > > > > > > > > > > > > > On Apr 5, 2018, at 3:37 AM, Marc-Aurèle Brothier < > > > > > ma...@exoscale.ch> > > > > > > > wrote: > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > Good point ilya but as stated by Sergey there's more > thing > > to > > > > > consider > > > > > > > > before being able to do a proper shutdown. I augmented my > > > > script > > > > > I gave > > > > > > > you > > > > > > > > originally and changed code in CS. What we're doing for > our > > > > > environment > > > > > > > is > > > > > > > > as follow: > > > > > > > > > > > > > > > > 1. the MGMT looks for a change in the file /etc/lb-agent > > > which > > > > > contains > > > > > > > > keywords for HAproxy[2] (ready, maint) so that HA-proxy > can > > > > > disable the > > > > > > > > mgmt on the keyword "maint" and the mgmt server stops a > > > couple > > > > of > > > > > > > > threads[1] to stop processing async jobs in the queue > > > > > > > > 2. Looks for the async jobs and wait until there is none > to > > > > > ensure you > > > > > > > can > > > > > > > > send the reconnect commands (if jobs are running, a > > reconnect > > > > > will > > > > > > result > > > > > > > > in a failed job since the result will never reach the > > > > management > > > > > > server - > > > > > > > > the agent waits for the current job to be done before > > > > > reconnecting, and > > > > > > > > discard the result... rooms for improvement here!) > > > > > > > > 3. Issue a reconnectHost command to all the hosts > connected > > > to > > > > > the mgmt > > > > > > > > server so that they reconnect to another one, otherwise > the > > > > mgmt > > > > > must > > > > > > be > > > > > > > up > > > > > > > > since it is used to forward commands to agents. > > > > > > > > 4. when all agents are reconnected, we can shutdown the > > > > > management > > > > > > server > > > > > > > > and perform the maintenance. > > > > > > > > > > > > > > > > One issue remains for me, during the reconnect, the > > commands > > > > > that are > > > > > > > > processed at the same time should be kept in a queue > until > > > the > > > > > agents > > > > > > > have > > > > > > > > finished any current jobs and have reconnected. Today the > > > > little > > > > > time > > > > > > > > window during which the reconnect happens can lead to > > failed > > > > > jobs due > > > > > > to > > > > > > > > the agent not being connected at the right moment. > > > > > > > > > > > > > > > > I could push a PR for the change to stop some processing > > > > threads > > > > > based > > > > > > on > > > > > > > > the content of a file. It's possible also to cancel the > > drain > > > > of > > > > > the > > > > > > > > management by simply changing the content of the file > back > > to > > > > > "ready" > > > > > > > > again, instead of "maint" [2]. > > > > > > > > > > > > > > > > [1] AsyncJobMgr-Heartbeat, CapacityChecker, > StatsCollector > > > > > > > > [2] HA proxy documentation on agent checker: > > > > > https://cbonte.github.io/ > > > > > > > > haproxy-dconv/1.6/configuration.html#5.2-agent-check > > > > > > > > > > > > > > > > Regarding your issue on the port blocking, I think it's > > fair > > > to > > > > > > consider > > > > > > > > that if you want to shutdown your server at some point, > you > > > > have > > > > > to > > > > > > stop > > > > > > > > serving (some) requests. Here the only way it's to stop > > > serving > > > > > > > everything. > > > > > > > > If the API had a REST design, we could reject any > > > > POST/PUT/DELETE > > > > > > > > operations and allow GET ones. I don't know how hard it > > would > > > > be > > > > > today > > > > > > to > > > > > > > > only allow listBaseCmd operations to be more friendly > with > > > the > > > > > users. > > > > > > > > > > > > > > > > Marco > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 5, 2018 at 2:22 AM, Sergey Levitskiy < > > > > > serg...@hotmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Now without spellchecking :) > > > > > > > >> > > > > > > > >> This is not simple e.g. for VMware. Each management > server > > > > also > > > > > acts > > > > > > as > > > > > > > an > > > > > > > >> agent proxy so tasks against a particular ESX host will > be > > > > > always > > > > > > > >> forwarded. That right answer will be to support a native > > > > > “maintenance > > > > > > > mode” > > > > > > > >> for management server. When entered to such mode the > > > > management > > > > > server > > > > > > > >> should release all agents including SSVM, block/redirect > > API > > > > > calls and > > > > > > > >> login request and finish all async job it originated. > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy < > > > > > serg...@hotmail.com > > > > > > > <mailto: > > > > > > > >> serg...@hotmail.com>> wrote: > > > > > > > >> > > > > > > > >> This is not simple e.g. for VMware. Each management > server > > > > also > > > > > acts > > > > > > as > > > > > > > an > > > > > > > >> agent proxy so tasks against a particular ESX host will > be > > > > > always > > > > > > > >> forwarded. That right answer will be to a native support > > for > > > > > > > “maintenance > > > > > > > >> mode” for management server. When entered to such mode > the > > > > > management > > > > > > > >> server should release all agents including save, > > > > block/redirect > > > > > API > > > > > > > calls > > > > > > > >> and login request and finish all a sync job it > originated. > > > > > > > >> > > > > > > > >> Sent from my iPhone > > > > > > > >> > > > > > > > >> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner < > > > > > > > >> rafaelweingart...@gmail.com<mailto:rafaelweingartner@ > > > > gmail.com > > > > > >> > > > > > > wrote: > > > > > > > >> > > > > > > > >> Ilya, still regarding the management server that is > being > > > shut > > > > > down > > > > > > > issue; > > > > > > > >> if other MSs/or maybe system VMs (I am not sure to know > if > > > > they > > > > > are > > > > > > > able to > > > > > > > >> do such tasks) can direct/redirect/send new jobs to this > > > > > management > > > > > > > server > > > > > > > >> (the one being shut down), the process might never end > > > because > > > > > new > > > > > > tasks > > > > > > > >> are always being created for the management server that > we > > > > want > > > > > to > > > > > > shut > > > > > > > >> down. Is this scenario possible? > > > > > > > >> > > > > > > > >> That is why I mentioned blocking the port 8250 for the > > > > > > > “graceful-shutdown”. > > > > > > > >> > > > > > > > >> If this scenario is not possible, then everything s > fine. > > > > > > > >> > > > > > > > >> > > > > > > > >> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev < > > > > > > > ilya.mailing.li...@gmail.com > > > > > > > >> <mailto:ilya.mailing.li...@gmail.com>> > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >> I'm thinking of using a configuration from > > > > > > > "job.cancel.threshold.minutes" - > > > > > > > >> it will be the longest > > > > > > > >> > > > > > > > >> "category": "Advanced", > > > > > > > >> > > > > > > > >> "description": "Time (in minutes) for async-jobs to > be > > > > > forcely > > > > > > > >> cancelled if it has been in process for long", > > > > > > > >> > > > > > > > >> "name": "job.cancel.threshold.minutes", > > > > > > > >> > > > > > > > >> "value": "60" > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner < > > > > > > > >> rafaelweingart...@gmail.com<mailto:rafaelweingartner@ > > > > gmail.com > > > > > >> > > > > > > wrote: > > > > > > > >> > > > > > > > >> Big +1 for this feature; I only have a few doubts. > > > > > > > >> > > > > > > > >> * Regarding the tasks/jobs that management servers (MSs) > > > > > execute; are > > > > > > > >> these > > > > > > > >> tasks originate from requests that come to the MS, or is > > it > > > > > possible > > > > > > > that > > > > > > > >> requests received by one management server to be > executed > > by > > > > > other? I > > > > > > > >> mean, > > > > > > > >> if I execute a request against MS1, will this request > > always > > > > be > > > > > > > >> executed/threated by MS1, or is it possible that this > > > request > > > > is > > > > > > > executed > > > > > > > >> by another MS (e.g. MS2)? > > > > > > > >> > > > > > > > >> * I would suggest that after we block traffic coming > from > > > > > > > >> 8080/8443/8250(we > > > > > > > >> will need to block this as well right?), we can log the > > > > > execution of > > > > > > > >> tasks. > > > > > > > >> I mean, something saying, there are XXX tasks (enumerate > > > > tasks) > > > > > still > > > > > > > >> being > > > > > > > >> executed, we will wait for them to finish before > shutting > > > > down. > > > > > > > >> > > > > > > > >> * The timeout (60 minutes suggested) could be global > > > settings > > > > > that we > > > > > > > can > > > > > > > >> load before executing the graceful-shutdown. > > > > > > > >> > > > > > > > >> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev < > > > > > > > >> ilya.mailing.li...@gmail.com<mailto:ilya.mailing.lists@ > > > > > gmail.com> > > > > > > > >> > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >> Use case: > > > > > > > >> In any environment - time to time - administrator needs > to > > > > > perform a > > > > > > > >> maintenance. Current stop sequence of cloudstack > > management > > > > > server > > > > > > will > > > > > > > >> ignore the fact that there may be long running async > jobs > > - > > > > and > > > > > > > >> terminate > > > > > > > >> the process. This in turn can create a poor user > > experience > > > > and > > > > > > > >> occasional > > > > > > > >> inconsistency in cloudstack db. > > > > > > > >> > > > > > > > >> This is especially painful in large environments where > the > > > > user > > > > > has > > > > > > > >> thousands of nodes and there is a continuous patching > that > > > > > happens > > > > > > > >> around > > > > > > > >> the clock - that requires migration of workload from one > > > node > > > > to > > > > > > > >> another. > > > > > > > >> > > > > > > > >> With that said - i've created a script that monitors the > > > async > > > > > job > > > > > > > >> queue > > > > > > > >> for given MS and waits for it complete all jobs. More > > > details > > > > > are > > > > > > > >> posted > > > > > > > >> below. > > > > > > > >> > > > > > > > >> I'd like to introduce "graceful-shutdown" into the > > > > > systemctl/service > > > > > > of > > > > > > > >> cloudstack-management service. > > > > > > > >> > > > > > > > >> The details of how it will work is below: > > > > > > > >> > > > > > > > >> Workflow for graceful shutdown: > > > > > > > >> Using iptables/firewalld - block any connection attempts > > on > > > > > 8080/8443 > > > > > > > >> (we > > > > > > > >> can identify the ports dynamically) > > > > > > > >> Identify the MSID for the node, using the proper msid - > > > query > > > > > > > >> async_job > > > > > > > >> table for > > > > > > > >> 1) any jobs that are still running (or job_status=“0”) > > > > > > > >> 2) job_dispatcher not like “pseudoJobDispatcher" > > > > > > > >> 3) job_init_msid=$my_ms_id > > > > > > > >> > > > > > > > >> Monitor this async_job table for 60 minutes - until all > > > async > > > > > jobs for > > > > > > > >> MSID > > > > > > > >> are done, then proceed with shutdown > > > > > > > >> If failed for any reason or terminated, catch the exit > > via > > > > trap > > > > > > > >> command > > > > > > > >> and unblock the 8080/8443 > > > > > > > >> > > > > > > > >> Comments are welcome > > > > > > > >> > > > > > > > >> Regards, > > > > > > > >> ilya > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> -- > > > > > > > >> Rafael Weingärtner > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> -- > > > > > > > >> Rafael Weingärtner > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Andrija Panić > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Rafael Weingärtner > > > > > > > > > > > > > > > -- > > Rafael Weingärtner > > > -- Rafael Weingärtner