Thanks for starting this discussion thread.

I think you've hit on a fundamental RPC programming model issue in CloudStack 
wherein the communication from the management server (control plane) to agents 
(agents) is uni-directional and management server is not aware how to process  
a response outside of the immediate thread of context.

This limitation is clearly visible and causes side-effects for long-running 
Commands where an Answer is not sent back but during which the control plane 
may become unavailable/restarts; since management server doesn't have the 
thread of context, any Answers sent back are ignored. Furthermore, when this 
happens agents get disconnected but continue to process all commands before 
reconnecting back to management server. This is a more serious problem for 
connected agents (such as KVM agents, ssvm/cpvm agents) than direct agents 
(those for VMware/XenServer etc), as direct agents are killed/stopped with the 
management server. The general side-effects include resources that were created 
but later ignored (requires manual cleanup for snapshots for ex. etc).

In the past I've had discussions with colleagues (both at work and in the 
community) and my recollection is this can be solved with: (brain dump of ideas 
and thoughts, some from old conversations, and some new)

  *   Refactor long-running Commands: introduce new child/abstract class or 
interface that separates normal Commands vs long-running Commands - that way we 
know which commands are long-running and should have special handlers. (top of 
my head Commands that do any storage work such as taking a snapshot are 
long-running)
  *   Rolling-ownership: safely delegate ownership to another management server 
with the passing context of handling an Answer for a set of long-running 
Commands (usually a Java method/class which is the handler, perhaps using DB + 
reflections)
  *   Bi-directional communication, message-bus based handlers: just like we've 
the Command-Answer patterns, we perhaps need a new RPC mechanism that is 
directional and secured (with CA framework), where agents can announce both 
streaming progress of some task (say template downloaded etc) and also support 
long-running tasks/answers that aren't ignored when control plane is 
unavailable.
     *   I had some thoughts around having a plugin-framework based embedded 
locking service within CloudStack (so turnkey and doesn't require separate 
infra, brokers etc.) that implements both (a) a lock server (so replace MySQL 
DB based GLOBAL_LOCK() too) and (b) a distributed message bus which can be used 
to store/update/delete/announce/queue tasks. This sort of locking/message bus 
framework can be implemented via pluggable plugins that say are implemented 
using mysql/db, embedded zookeeper, or hazelcast. We had done some poc as part 
of an internal hackathon in the past 
(https://github.com/shapeblue/cloudstack/tree/locking-service).
     *   Maybe a more modern approach would be to look at how other projects 
are solving this problem, maybe explore other RPC frameworks such as gRPC.

Regards.

________________________________
From: Marcus <shadow...@gmail.com>
Sent: Wednesday, October 6, 2021 22:36
To: dev@cloudstack.apache.org <dev@cloudstack.apache.org>
Subject: Re: KVM Agent disconnect hooks

Thanks, Daan, I appreciate the feedback and I hope it is useful.  I wish I
could comment more on the other hypervisors and how something similar might
work for them.

On Wed, Oct 6, 2021 at 2:49 AM Daan Hoogland <daan.hoogl...@gmail.com>
wrote:

> thanks Marcus,
>

 

> On Tue, Oct 5, 2021 at 7:32 PM Marcus <shadow...@gmail.com> wrote:
>
> > Hi everyone! It's been awhile.  I've got a feature I'd love to get some
> > feedback on and contribute to the community, if it's acceptable.  I need
> to
> > brush up on the proper process (did read CONTRIBUTING.md).
>
> Not a lot has changed. The technical discussions have tended to move to
> github issues but there is of course a push back on that as it doesn't
> comply with apache bylaws.
>
> > I should have
> > discussed this *before* implementation, for sure, but since this is
> > something I've already got I figured I'd use it to go through the process
> > and refresh myself on the latest.
> >
> > https://github.com/apache/cloudstack/pull/5552
>
> I like the concept you describe.
>
>
> >
> >
> > If you're familiar with the KVM agent, I needed to provide a way for
> > long-running jobs to be able to react and clean up their work when the
> > agent or management server is stopped, or they just lose connectivity
> with
> > each other.  Currently, if the management server is restarted while the
> > agent is working on something, the agent-side work could continue on and
> > complete, but the management server would fail the job.  This is ok in
> many
> > circumstances, but sometimes this can lead to cruft like copied files
> that
> > are never used.
> >
> as stated on the PR, it is not only a problem for KVM (but one thing at a
> time)
>
>
> >
> > I'm not entirely happy with this as there's potential for race,
> > particularly in de-registration of the hook, but it seems like a
> reasonable
> > start. It just requires the coder of the hook to understand that a
> rollback
> > could be attempted even if the bulk of their task has completed, or not
> > started yet, and account for that, whether they pass a lock, or do a
> "try",
> > or something else.
> >
> reconnect could include a 'status of work' dialog, I imagine this is what
> you mean by "requires the coder .. to understand". If the MS forgot about a
> job, the agent will probably not and can send the information back.
>
> let's see your first work though first (as in +1)
>
>
> --
> Daan
>

Reply via email to