Hi Rohit, When Management Server and Agent are up and running and there is a network failure, I think it is better to wait for some time for the pending tasks to complete, instead of failing them and try reconnecting. If network delay is minimal, there can be a valid thread/context in the management server to handle the answers.
It would be great if there are no major side-effects with this PR changes. Thanks, Suresh On Wed, May 16, 2018 at 3:40 PM, Rohit Yadav <rohit.ya...@shapeblue.com> wrote: > All, > > > Based on testing against KVM, XenServer and VMware and this discussion, > I'll merged the PR based on code reviews and tests. I investigated both > code-wise and against live environment for possible side-effects of letting > agent connect without being blocked on pending tasks and I found no new > fault behaviour. > > > If there are any objections or bugs, please share in which case we'll > revert the change to continue legacy/historic behaviour. Thanks. > > > - Rohit > > <https://cloudstack.apache.org> > > > > ________________________________ > From: Rohit Yadav <rohit.ya...@shapeblue.com> > Sent: Tuesday, May 15, 2018 2:37:58 PM > To: dev@cloudstack.apache.org > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt > server) disconnection? > > Hi Suresh, > > > I've replied to your comment on the PR. In addition, when (i) management > server is restarted any pending operation on KVM/SSVM agent side will fail > fail to be communicated back in the correct thread/context and it depends > on a specific feature whether is supports sync or cleanup mechanism, in > most cases, the async/job timeout may kick in or cause queue/concurrent > failure seen in logs. When (ii) agent is reconnected, it reconnects only > after any pending job finishes therefore such jobs finish and fail to be > communicated back to the mgmt server (the answer instance is failed to be > sent on the link, as link is no longer valid and causes exception). > > > - Rohit > > <https://cloudstack.apache.org> > > > > ________________________________ > From: Suresh Kumar Anaparti <sureshkumar.anapa...@gmail.com> > Sent: Tuesday, May 15, 2018 12:06:14 AM > To: dev@cloudstack.apache.org > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt > server) disconnection? > > Hi, > > @rhtyd, I checked the PR changes. Good that the agent is not waiting for > the pending jobs and retrying connection to management server. This might > have impact on ssvm and kvm agent tasks, not much on cpvm. Any sync or > cleanup mechanism for Volumes/VMs to address the failed/pending agent jobs > after (i) management server restart and (ii) agent connected ? > > -Suresh > > On Mon, May 14, 2018 at 8:05 PM, Marc-Aurèle Brothier <ma...@exoscale.ch> > wrote: > > > Correct about the thread context, so if the answer is coming into a > > management server that doesn't have the context and drops it, it should > be > > fine then. The PR is then already a good improvement to let the agent > > reconnect even when it's doing a long processing request, so it can keeps > > on completing other jobs too. > > > > Regarding the restart/shutdown operation, yes I have to push now the > > changes to be able to stop some processing tasks (fetching new async jobs > > mainly) on a management server to ensure a cleaner shutdown. My solution, > > as said, is based on the content of a file that is compatible with HA > > proxy, thus not the LB mechanism added recently in CS. It could be > changed > > for an API call to put/move out a management server from maintenance. The > > listManagementServers API call has been merged and it was a requirement > for > > that. > > > > About Zookeeper, it's not on the rolling shutdown/restart for now. We are > > using it as an efficient and true lock mechanism between multiple > > management servers. We are slowly moving the locks code towards ZK and > > added one during the allocation phase to ensure no host would be over > > allocated. I will take this discussion in another email threads since I > > have a few questions regarding ZK and also which to talk about the > > connection between the agent & management servers. > > > > On Mon, May 14, 2018 at 2:39 PM, Rohit Yadav <rohit.ya...@shapeblue.com> > > wrote: > > > > > Thanks Marc and Rafael for replying. > > > > > > > > > In my experimentation, when agent disconnects if will wait for the > > pending > > > jobs/task to complete and on completion it creates an Answer instance > and > > > tries to sent it using a `link` which no longer exists and fails. This > is > > > current behaviour, on the mgmt server side the resource/task will be > left > > > hanging and may not be automatically marked failed right away (may be > > after > > > the configured timeout). My best guess is that the application of the > > > change should likely not have any side-effects, other than the > > > exceptions/faults we already observe. > > > > > > > > > In my test, the failed async job did not get retried and I hit the > famour > > > 'concurrency limit 1' issue. At this point, I had to manually cleanup > the > > > snapshot row, the rows from sync_queue, sync_queue_item and async_job. > > The > > > current implementation we have on the agent side where mgmt server > send a > > > cmd and agent returns an answer after processing it -- we don't have > the > > > same for mgmt server where an agent sends a cmd's answer and mgmt > server > > > processes it irrespective of the context. Therefore, unless the answer > > > receiving mgmt server is not in the right thread/context/state those > > > answers are dropped. > > > > > > > > > I think we need to solve for (1) claim and ownership management of a > > > resource (how to manage when the owner/mgmt server shuts down or dies), > > (2) > > > task handover - executing tasks (in-flight) when mgmt server is > shutdown > > to > > > other mgmt server, (3) central locking-service for this and other uses. > > The > > > bigger change ties with the other things we've seen in the discussion > > > around mgmt server restart/shutdown. Till the time we get to solving > the > > > bigger issue, perhaps we can provide some API/visual/UI ways to show > the > > > root admin the async jobs in flight for a management server or alert > him, > > > perhaps an API to do cleaner mgmt server shutdown that waits for all > > > pending async jobs on a mgmg server to complete and does not take any > new > > > async/job API requests (say like Jenkins does with jobs)? > > > > > > > > > Marc - were n't you working on a zookeeper based rolling > > shutdown/restart? > > > Did that handle some of the failure cases? > > > > > > > > > - Rohit > > > > > > <https://cloudstack.apache.org> > > > > > > > > > > > > ________________________________ > > > From: Marc-Aurèle Brothier <ma...@exoscale.ch> > > > Sent: Monday, May 14, 2018 4:06:56 PM > > > To: dev@cloudstack.apache.org > > > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on > (mgmt > > > server) disconnection? > > > > > > Hi, > > > > > > I'm also for a bigger change but this PR already moves forward to a > > better > > > agent <-> management connection hanlding. > > > > > > @rhtyd did you test your PR manually by, for example, requesting a long > > > snapshot operation and disconnecting the agent. > > > > > > I have one concern here: when an async job is taken from the DB by a > > > management server (in a cluster configuration), the mgmgt ID is put in > > the > > > row to tell which mgmt is managing the job. On disconnection from an > > agent, > > > the event is propagated and the job is mark as failed in the database, > > and > > > an error is return in the API for that command. Here we are only > > resolving > > > the fact to let the agent reconnect quickly but I'm unsure of what will > > > happen in the mgmt when the job response is received by a mgmt (which > > might > > > be another one than the one registered in the job db row). I know it's > > here > > > it's becoming complicated because one async job might be only one part > > of a > > > bigger scenario for a command (like a live migration). I just want to > > > ensure it won't propagate further inconsistency. > > > > > > Marco > > > > > > On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner < > > > rafaelweingart...@gmail.com> wrote: > > > > > > > Would prefer “A bigger design fix would be to make management server > > > > asynchronous of agent side answer/response handling”. However, I > > > understand > > > > the volume of changes that requires. > > > > > > > > I looked at the PR, and I think that everything is ok there. Of > > course, I > > > > think we might need some more time to review and think about the > > possible > > > > outcomes of such changes. > > > > > > > > On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav < > > rohit.ya...@shapeblue.com> > > > > wrote: > > > > > > > > > All, > > > > > > > > > > > > > > > Historically, when the agent (kvm, ssvm, cpvm) is disconnected from > > the > > > > > management server (say due to mgmt server restart etc), the > > > reconnection > > > > > logic waits for any pending tasks/commands to complete before > > > > reconnection > > > > > attempts are made. I tried to search git history but could not > find a > > > > > reason, can anyone share why we may need this? > > > > > > > > > > > > > > > Based on the reported issue: > > > > > > > > > > https://github.com/apache/cloudstack/issues/2633 > > > > > > > > > > > > > > > I've a working patch which removes this limitation: > > > > > > > > > > https://github.com/apache/cloudstack/pull/2638 > > > > > > > > > > > > > > > From testing with various combinations of tasks, I found that when > > that > > > > > happens even if the pending task succeeds it fails to send an > Answer > > to > > > > the > > > > > mgmt server, therefore from the control plane's perspective that > task > > > is > > > > > still pending/on-going. > > > > > > > > > > > > > > > When the mgmt server comes back online, and the agent finally > > > reconnects > > > > > (pending on how long the pending task took) the executed operation > is > > > > still > > > > > pending in mgmt server's view and may sometimes require manual > > cleanups > > > > in > > > > > database. By removing the limitation in above PR, at least the > agent > > > > > reconnects faster while of the failure/fault behaviours remain the > > > same. > > > > A > > > > > bigger design fix would be to make management server asynchronous > of > > > > agent > > > > > side answer/response handling. > > > > > > > > > > > > > > > - Rohit > > > > > > > > > > <https://cloudstack.apache.org> > > > > > > > > > > > > > > > > > > > > rohit.ya...@shapeblue.com > > > > > www.shapeblue.com<http://www.shapeblue.com> > > > > > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > > > > @shapeblue > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Rafael Weingärtner > > > > > > > > > > rohit.ya...@shapeblue.com > > > www.shapeblue.com<http://www.shapeblue.com> > > > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > > @shapeblue > > > > > > > > > > > > > > > > rohit.ya...@shapeblue.com > www.shapeblue.com<http://www.shapeblue.com> > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > @shapeblue > > > > > rohit.ya...@shapeblue.com > www.shapeblue.com > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > @shapeblue > > > >