Hi, I'm also for a bigger change but this PR already moves forward to a better agent <-> management connection hanlding.
@rhtyd did you test your PR manually by, for example, requesting a long snapshot operation and disconnecting the agent. I have one concern here: when an async job is taken from the DB by a management server (in a cluster configuration), the mgmgt ID is put in the row to tell which mgmt is managing the job. On disconnection from an agent, the event is propagated and the job is mark as failed in the database, and an error is return in the API for that command. Here we are only resolving the fact to let the agent reconnect quickly but I'm unsure of what will happen in the mgmt when the job response is received by a mgmt (which might be another one than the one registered in the job db row). I know it's here it's becoming complicated because one async job might be only one part of a bigger scenario for a command (like a live migration). I just want to ensure it won't propagate further inconsistency. Marco On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner < rafaelweingart...@gmail.com> wrote: > Would prefer “A bigger design fix would be to make management server > asynchronous of agent side answer/response handling”. However, I understand > the volume of changes that requires. > > I looked at the PR, and I think that everything is ok there. Of course, I > think we might need some more time to review and think about the possible > outcomes of such changes. > > On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav <rohit.ya...@shapeblue.com> > wrote: > > > All, > > > > > > Historically, when the agent (kvm, ssvm, cpvm) is disconnected from the > > management server (say due to mgmt server restart etc), the reconnection > > logic waits for any pending tasks/commands to complete before > reconnection > > attempts are made. I tried to search git history but could not find a > > reason, can anyone share why we may need this? > > > > > > Based on the reported issue: > > > > https://github.com/apache/cloudstack/issues/2633 > > > > > > I've a working patch which removes this limitation: > > > > https://github.com/apache/cloudstack/pull/2638 > > > > > > From testing with various combinations of tasks, I found that when that > > happens even if the pending task succeeds it fails to send an Answer to > the > > mgmt server, therefore from the control plane's perspective that task is > > still pending/on-going. > > > > > > When the mgmt server comes back online, and the agent finally reconnects > > (pending on how long the pending task took) the executed operation is > still > > pending in mgmt server's view and may sometimes require manual cleanups > in > > database. By removing the limitation in above PR, at least the agent > > reconnects faster while of the failure/fault behaviours remain the same. > A > > bigger design fix would be to make management server asynchronous of > agent > > side answer/response handling. > > > > > > - Rohit > > > > <https://cloudstack.apache.org> > > > > > > > > rohit.ya...@shapeblue.com > > www.shapeblue.com > > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > @shapeblue > > > > > > > > > > > -- > Rafael Weingärtner >