Yes! I totally agree, this area is rich with innovation opportunities. I
particularly like the idea of being able to get progress updates on long
running tasks.  Often long running work with no feedback makes users/admins
feel like something is wrong and they can mistakenly take measures to
intervene.

On Wed, Oct 13, 2021 at 1:16 AM Rohit Yadav <rohit.ya...@shapeblue.com>
wrote:

> Thanks for starting this discussion thread.
>
> I think you've hit on a fundamental RPC programming model issue in
> CloudStack wherein the communication from the management server (control
> plane) to agents (agents) is uni-directional and management server is not
> aware how to process  a response outside of the immediate thread of context.
>
> This limitation is clearly visible and causes side-effects for
> long-running Commands where an Answer is not sent back but during which the
> control plane may become unavailable/restarts; since management server
> doesn't have the thread of context, any Answers sent back are ignored.
> Furthermore, when this happens agents get disconnected but continue to
> process all commands before reconnecting back to management server. This is
> a more serious problem for connected agents (such as KVM agents, ssvm/cpvm
> agents) than direct agents (those for VMware/XenServer etc), as direct
> agents are killed/stopped with the management server. The general
> side-effects include resources that were created but later ignored
> (requires manual cleanup for snapshots for ex. etc).
>
> In the past I've had discussions with colleagues (both at work and in the
> community) and my recollection is this can be solved with: (brain dump of
> ideas and thoughts, some from old conversations, and some new)
>
>   *   Refactor long-running Commands: introduce new child/abstract class
> or interface that separates normal Commands vs long-running Commands - that
> way we know which commands are long-running and should have special
> handlers. (top of my head Commands that do any storage work such as taking
> a snapshot are long-running)
>   *   Rolling-ownership: safely delegate ownership to another management
> server with the passing context of handling an Answer for a set of
> long-running Commands (usually a Java method/class which is the handler,
> perhaps using DB + reflections)
>   *   Bi-directional communication, message-bus based handlers: just like
> we've the Command-Answer patterns, we perhaps need a new RPC mechanism that
> is directional and secured (with CA framework), where agents can announce
> both streaming progress of some task (say template downloaded etc) and also
> support long-running tasks/answers that aren't ignored when control plane
> is unavailable.
>      *   I had some thoughts around having a plugin-framework based
> embedded locking service within CloudStack (so turnkey and doesn't require
> separate infra, brokers etc.) that implements both (a) a lock server (so
> replace MySQL DB based GLOBAL_LOCK() too) and (b) a distributed message bus
> which can be used to store/update/delete/announce/queue tasks. This sort of
> locking/message bus framework can be implemented via pluggable plugins that
> say are implemented using mysql/db, embedded zookeeper, or hazelcast. We
> had done some poc as part of an internal hackathon in the past (
> https://github.com/shapeblue/cloudstack/tree/locking-service).
>      *   Maybe a more modern approach would be to look at how other
> projects are solving this problem, maybe explore other RPC frameworks such
> as gRPC.
>
> Regards.
>
> ________________________________
> From: Marcus <shadow...@gmail.com>
> Sent: Wednesday, October 6, 2021 22:36
> To: dev@cloudstack.apache.org <dev@cloudstack.apache.org>
> Subject: Re: KVM Agent disconnect hooks
>
> Thanks, Daan, I appreciate the feedback and I hope it is useful.  I wish I
> could comment more on the other hypervisors and how something similar might
> work for them.
>
> On Wed, Oct 6, 2021 at 2:49 AM Daan Hoogland <daan.hoogl...@gmail.com>
> wrote:
>
> > thanks Marcus,
> >
>
>
>
> > On Tue, Oct 5, 2021 at 7:32 PM Marcus <shadow...@gmail.com> wrote:
> >
> > > Hi everyone! It's been awhile.  I've got a feature I'd love to get some
> > > feedback on and contribute to the community, if it's acceptable.  I
> need
> > to
> > > brush up on the proper process (did read CONTRIBUTING.md).
> >
> > Not a lot has changed. The technical discussions have tended to move to
> > github issues but there is of course a push back on that as it doesn't
> > comply with apache bylaws.
> >
> > > I should have
> > > discussed this *before* implementation, for sure, but since this is
> > > something I've already got I figured I'd use it to go through the
> process
> > > and refresh myself on the latest.
> > >
> > > https://github.com/apache/cloudstack/pull/5552
> >
> > I like the concept you describe.
> >
> >
> > >
> > >
> > > If you're familiar with the KVM agent, I needed to provide a way for
> > > long-running jobs to be able to react and clean up their work when the
> > > agent or management server is stopped, or they just lose connectivity
> > with
> > > each other.  Currently, if the management server is restarted while the
> > > agent is working on something, the agent-side work could continue on
> and
> > > complete, but the management server would fail the job.  This is ok in
> > many
> > > circumstances, but sometimes this can lead to cruft like copied files
> > that
> > > are never used.
> > >
> > as stated on the PR, it is not only a problem for KVM (but one thing at a
> > time)
> >
> >
> > >
> > > I'm not entirely happy with this as there's potential for race,
> > > particularly in de-registration of the hook, but it seems like a
> > reasonable
> > > start. It just requires the coder of the hook to understand that a
> > rollback
> > > could be attempted even if the bulk of their task has completed, or not
> > > started yet, and account for that, whether they pass a lock, or do a
> > "try",
> > > or something else.
> > >
> > reconnect could include a 'status of work' dialog, I imagine this is what
> > you mean by "requires the coder .. to understand". If the MS forgot
> about a
> > job, the agent will probably not and can send the information back.
> >
> > let's see your first work though first (as in +1)
> >
> >
> > --
> > Daan
> >
>

Reply via email to