Yes! I totally agree, this area is rich with innovation opportunities. I particularly like the idea of being able to get progress updates on long running tasks. Often long running work with no feedback makes users/admins feel like something is wrong and they can mistakenly take measures to intervene.
On Wed, Oct 13, 2021 at 1:16 AM Rohit Yadav <rohit.ya...@shapeblue.com> wrote: > Thanks for starting this discussion thread. > > I think you've hit on a fundamental RPC programming model issue in > CloudStack wherein the communication from the management server (control > plane) to agents (agents) is uni-directional and management server is not > aware how to process a response outside of the immediate thread of context. > > This limitation is clearly visible and causes side-effects for > long-running Commands where an Answer is not sent back but during which the > control plane may become unavailable/restarts; since management server > doesn't have the thread of context, any Answers sent back are ignored. > Furthermore, when this happens agents get disconnected but continue to > process all commands before reconnecting back to management server. This is > a more serious problem for connected agents (such as KVM agents, ssvm/cpvm > agents) than direct agents (those for VMware/XenServer etc), as direct > agents are killed/stopped with the management server. The general > side-effects include resources that were created but later ignored > (requires manual cleanup for snapshots for ex. etc). > > In the past I've had discussions with colleagues (both at work and in the > community) and my recollection is this can be solved with: (brain dump of > ideas and thoughts, some from old conversations, and some new) > > * Refactor long-running Commands: introduce new child/abstract class > or interface that separates normal Commands vs long-running Commands - that > way we know which commands are long-running and should have special > handlers. (top of my head Commands that do any storage work such as taking > a snapshot are long-running) > * Rolling-ownership: safely delegate ownership to another management > server with the passing context of handling an Answer for a set of > long-running Commands (usually a Java method/class which is the handler, > perhaps using DB + reflections) > * Bi-directional communication, message-bus based handlers: just like > we've the Command-Answer patterns, we perhaps need a new RPC mechanism that > is directional and secured (with CA framework), where agents can announce > both streaming progress of some task (say template downloaded etc) and also > support long-running tasks/answers that aren't ignored when control plane > is unavailable. > * I had some thoughts around having a plugin-framework based > embedded locking service within CloudStack (so turnkey and doesn't require > separate infra, brokers etc.) that implements both (a) a lock server (so > replace MySQL DB based GLOBAL_LOCK() too) and (b) a distributed message bus > which can be used to store/update/delete/announce/queue tasks. This sort of > locking/message bus framework can be implemented via pluggable plugins that > say are implemented using mysql/db, embedded zookeeper, or hazelcast. We > had done some poc as part of an internal hackathon in the past ( > https://github.com/shapeblue/cloudstack/tree/locking-service). > * Maybe a more modern approach would be to look at how other > projects are solving this problem, maybe explore other RPC frameworks such > as gRPC. > > Regards. > > ________________________________ > From: Marcus <shadow...@gmail.com> > Sent: Wednesday, October 6, 2021 22:36 > To: dev@cloudstack.apache.org <dev@cloudstack.apache.org> > Subject: Re: KVM Agent disconnect hooks > > Thanks, Daan, I appreciate the feedback and I hope it is useful. I wish I > could comment more on the other hypervisors and how something similar might > work for them. > > On Wed, Oct 6, 2021 at 2:49 AM Daan Hoogland <daan.hoogl...@gmail.com> > wrote: > > > thanks Marcus, > > > > > > > On Tue, Oct 5, 2021 at 7:32 PM Marcus <shadow...@gmail.com> wrote: > > > > > Hi everyone! It's been awhile. I've got a feature I'd love to get some > > > feedback on and contribute to the community, if it's acceptable. I > need > > to > > > brush up on the proper process (did read CONTRIBUTING.md). > > > > Not a lot has changed. The technical discussions have tended to move to > > github issues but there is of course a push back on that as it doesn't > > comply with apache bylaws. > > > > > I should have > > > discussed this *before* implementation, for sure, but since this is > > > something I've already got I figured I'd use it to go through the > process > > > and refresh myself on the latest. > > > > > > https://github.com/apache/cloudstack/pull/5552 > > > > I like the concept you describe. > > > > > > > > > > > > > If you're familiar with the KVM agent, I needed to provide a way for > > > long-running jobs to be able to react and clean up their work when the > > > agent or management server is stopped, or they just lose connectivity > > with > > > each other. Currently, if the management server is restarted while the > > > agent is working on something, the agent-side work could continue on > and > > > complete, but the management server would fail the job. This is ok in > > many > > > circumstances, but sometimes this can lead to cruft like copied files > > that > > > are never used. > > > > > as stated on the PR, it is not only a problem for KVM (but one thing at a > > time) > > > > > > > > > > I'm not entirely happy with this as there's potential for race, > > > particularly in de-registration of the hook, but it seems like a > > reasonable > > > start. It just requires the coder of the hook to understand that a > > rollback > > > could be attempted even if the bulk of their task has completed, or not > > > started yet, and account for that, whether they pass a lock, or do a > > "try", > > > or something else. > > > > > reconnect could include a 'status of work' dialog, I imagine this is what > > you mean by "requires the coder .. to understand". If the MS forgot > about a > > job, the agent will probably not and can send the information back. > > > > let's see your first work though first (as in +1) > > > > > > -- > > Daan > > >