Thanks Marc and Rafael for replying.

In my experimentation, when agent disconnects if will wait for the pending 
jobs/task to complete and on completion it creates an Answer instance and tries 
to sent it using a `link` which no longer exists and fails. This is current 
behaviour, on the mgmt server side the resource/task will be left hanging and 
may not be automatically marked failed right away (may be after the configured 
timeout). My best guess is that the application of the change should likely not 
have any side-effects, other than the exceptions/faults we already observe.


In my test, the failed async job did not get retried and I hit the famour 
'concurrency limit 1' issue. At this point, I had to manually cleanup the 
snapshot row, the rows from sync_queue, sync_queue_item and async_job.  The 
current implementation we have on the agent side where mgmt server send a cmd 
and agent returns an answer after processing it -- we don't have the same for 
mgmt server where an agent sends a cmd's answer and mgmt server processes it 
irrespective of the context. Therefore, unless the answer receiving mgmt server 
is not in the right thread/context/state those answers are dropped.


I think we need to solve for (1) claim and ownership management of a resource 
(how to manage when the owner/mgmt server shuts down or dies), (2) task 
handover - executing tasks (in-flight) when mgmt server is shutdown to other 
mgmt server, (3) central locking-service for this and other uses. The bigger 
change ties with the other things we've seen in the discussion around mgmt 
server restart/shutdown. Till the time we get to solving the bigger issue,  
perhaps we can provide some API/visual/UI ways to show the root admin the async 
jobs in flight for a management server or alert him, perhaps an API to do 
cleaner mgmt server shutdown that waits for all pending async jobs on a mgmg 
server to complete and does not take any new async/job API requests (say like 
Jenkins does with jobs)?


Marc - were n't you working on a zookeeper based rolling shutdown/restart? Did 
that handle some of the failure cases?


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Marc-Aurèle Brothier <ma...@exoscale.ch>
Sent: Monday, May 14, 2018 4:06:56 PM
To: dev@cloudstack.apache.org
Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt 
server) disconnection?

Hi,

I'm also for a bigger change but this PR already moves forward to a better
agent <-> management connection hanlding.

@rhtyd did you test your PR manually by, for example, requesting a long
snapshot operation and disconnecting the agent.

I have one concern here: when an async job is taken from the DB by a
management server (in a cluster configuration), the mgmgt ID is put in the
row to tell which mgmt is managing the job. On disconnection from an agent,
the event is propagated and the job is mark as failed in the database, and
an error is return in the API for that command. Here we are only resolving
the fact to let the agent reconnect quickly but I'm unsure of what will
happen in the mgmt when the job response is received by a mgmt (which might
be another one than the one registered in the job db row). I know it's here
it's becoming complicated because one async job might be only one part of a
bigger scenario for a command (like a live migration). I just want to
ensure it won't propagate further inconsistency.

Marco

On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner <
rafaelweingart...@gmail.com> wrote:

> Would prefer “A bigger design fix would be to make management server
> asynchronous of agent side answer/response handling”. However, I understand
> the volume of changes that requires.
>
> I looked at the PR, and I think that everything is ok there. Of course, I
> think we might need some more time to review and think about the possible
> outcomes of such changes.
>
> On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav <rohit.ya...@shapeblue.com>
> wrote:
>
> > All,
> >
> >
> > Historically, when the agent (kvm, ssvm, cpvm) is disconnected from the
> > management server (say due to mgmt server restart etc), the reconnection
> > logic waits for any pending tasks/commands to complete before
> reconnection
> > attempts are made. I tried to search git history but could not find a
> > reason, can anyone share why we may need this?
> >
> >
> > Based on the reported issue:
> >
> > https://github.com/apache/cloudstack/issues/2633
> >
> >
> > I've a working patch which removes this limitation:
> >
> > https://github.com/apache/cloudstack/pull/2638
> >
> >
> > From testing with various combinations of tasks, I found that when that
> > happens even if the pending task succeeds it fails to send an Answer to
> the
> > mgmt server, therefore from the control plane's perspective that task is
> > still pending/on-going.
> >
> >
> > When the mgmt server comes back online, and the agent finally reconnects
> > (pending on how long the pending task took) the executed operation is
> still
> > pending in mgmt server's view and may sometimes require manual cleanups
> in
> > database. By removing the limitation in above PR, at least the agent
> > reconnects faster while of the failure/fault behaviours remain the same.
> A
> > bigger design fix would be to make management server asynchronous of
> agent
> > side answer/response handling.
> >
> >
> > - Rohit
> >
> > <https://cloudstack.apache.org>
> >
> >
> >
> > rohit.ya...@shapeblue.com
> > www.shapeblue.com<http://www.shapeblue.com>
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue
> >
> >
> >
> >
>
>
> --
> Rafael Weingärtner
>

rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

Reply via email to