Thanks for starting this discussion thread. I think you've hit on a fundamental RPC programming model issue in CloudStack wherein the communication from the management server (control plane) to agents (agents) is uni-directional and management server is not aware how to process a response outside of the immediate thread of context.
This limitation is clearly visible and causes side-effects for long-running Commands where an Answer is not sent back but during which the control plane may become unavailable/restarts; since management server doesn't have the thread of context, any Answers sent back are ignored. Furthermore, when this happens agents get disconnected but continue to process all commands before reconnecting back to management server. This is a more serious problem for connected agents (such as KVM agents, ssvm/cpvm agents) than direct agents (those for VMware/XenServer etc), as direct agents are killed/stopped with the management server. The general side-effects include resources that were created but later ignored (requires manual cleanup for snapshots for ex. etc). In the past I've had discussions with colleagues (both at work and in the community) and my recollection is this can be solved with: (brain dump of ideas and thoughts, some from old conversations, and some new) * Refactor long-running Commands: introduce new child/abstract class or interface that separates normal Commands vs long-running Commands - that way we know which commands are long-running and should have special handlers. (top of my head Commands that do any storage work such as taking a snapshot are long-running) * Rolling-ownership: safely delegate ownership to another management server with the passing context of handling an Answer for a set of long-running Commands (usually a Java method/class which is the handler, perhaps using DB + reflections) * Bi-directional communication, message-bus based handlers: just like we've the Command-Answer patterns, we perhaps need a new RPC mechanism that is directional and secured (with CA framework), where agents can announce both streaming progress of some task (say template downloaded etc) and also support long-running tasks/answers that aren't ignored when control plane is unavailable. * I had some thoughts around having a plugin-framework based embedded locking service within CloudStack (so turnkey and doesn't require separate infra, brokers etc.) that implements both (a) a lock server (so replace MySQL DB based GLOBAL_LOCK() too) and (b) a distributed message bus which can be used to store/update/delete/announce/queue tasks. This sort of locking/message bus framework can be implemented via pluggable plugins that say are implemented using mysql/db, embedded zookeeper, or hazelcast. We had done some poc as part of an internal hackathon in the past (https://github.com/shapeblue/cloudstack/tree/locking-service). * Maybe a more modern approach would be to look at how other projects are solving this problem, maybe explore other RPC frameworks such as gRPC. Regards. ________________________________ From: Marcus <shadow...@gmail.com> Sent: Wednesday, October 6, 2021 22:36 To: dev@cloudstack.apache.org <dev@cloudstack.apache.org> Subject: Re: KVM Agent disconnect hooks Thanks, Daan, I appreciate the feedback and I hope it is useful. I wish I could comment more on the other hypervisors and how something similar might work for them. On Wed, Oct 6, 2021 at 2:49 AM Daan Hoogland <daan.hoogl...@gmail.com> wrote: > thanks Marcus, > > On Tue, Oct 5, 2021 at 7:32 PM Marcus <shadow...@gmail.com> wrote: > > > Hi everyone! It's been awhile. I've got a feature I'd love to get some > > feedback on and contribute to the community, if it's acceptable. I need > to > > brush up on the proper process (did read CONTRIBUTING.md). > > Not a lot has changed. The technical discussions have tended to move to > github issues but there is of course a push back on that as it doesn't > comply with apache bylaws. > > > I should have > > discussed this *before* implementation, for sure, but since this is > > something I've already got I figured I'd use it to go through the process > > and refresh myself on the latest. > > > > https://github.com/apache/cloudstack/pull/5552 > > I like the concept you describe. > > > > > > > > If you're familiar with the KVM agent, I needed to provide a way for > > long-running jobs to be able to react and clean up their work when the > > agent or management server is stopped, or they just lose connectivity > with > > each other. Currently, if the management server is restarted while the > > agent is working on something, the agent-side work could continue on and > > complete, but the management server would fail the job. This is ok in > many > > circumstances, but sometimes this can lead to cruft like copied files > that > > are never used. > > > as stated on the PR, it is not only a problem for KVM (but one thing at a > time) > > > > > > I'm not entirely happy with this as there's potential for race, > > particularly in de-registration of the hook, but it seems like a > reasonable > > start. It just requires the coder of the hook to understand that a > rollback > > could be attempted even if the bulk of their task has completed, or not > > started yet, and account for that, whether they pass a lock, or do a > "try", > > or something else. > > > reconnect could include a 'status of work' dialog, I imagine this is what > you mean by "requires the coder .. to understand". If the MS forgot about a > job, the agent will probably not and can send the information back. > > let's see your first work though first (as in +1) > > > -- > Daan >