thanks Marcus, On Tue, Oct 5, 2021 at 7:32 PM Marcus <shadow...@gmail.com> wrote:
> Hi everyone! It's been awhile. I've got a feature I'd love to get some > feedback on and contribute to the community, if it's acceptable. I need to > brush up on the proper process (did read CONTRIBUTING.md). Not a lot has changed. The technical discussions have tended to move to github issues but there is of course a push back on that as it doesn't comply with apache bylaws. > I should have > discussed this *before* implementation, for sure, but since this is > something I've already got I figured I'd use it to go through the process > and refresh myself on the latest. > > https://github.com/apache/cloudstack/pull/5552 I like the concept you describe. > > > If you're familiar with the KVM agent, I needed to provide a way for > long-running jobs to be able to react and clean up their work when the > agent or management server is stopped, or they just lose connectivity with > each other. Currently, if the management server is restarted while the > agent is working on something, the agent-side work could continue on and > complete, but the management server would fail the job. This is ok in many > circumstances, but sometimes this can lead to cruft like copied files that > are never used. > as stated on the PR, it is not only a problem for KVM (but one thing at a time) > > I'm not entirely happy with this as there's potential for race, > particularly in de-registration of the hook, but it seems like a reasonable > start. It just requires the coder of the hook to understand that a rollback > could be attempted even if the bulk of their task has completed, or not > started yet, and account for that, whether they pass a lock, or do a "try", > or something else. > reconnect could include a 'status of work' dialog, I imagine this is what you mean by "requires the coder .. to understand". If the MS forgot about a job, the agent will probably not and can send the information back. let's see your first work though first (as in +1) -- Daan