Thanks, Daan, I appreciate the feedback and I hope it is useful. I wish I could comment more on the other hypervisors and how something similar might work for them.
On Wed, Oct 6, 2021 at 2:49 AM Daan Hoogland <daan.hoogl...@gmail.com> wrote: > thanks Marcus, > > On Tue, Oct 5, 2021 at 7:32 PM Marcus <shadow...@gmail.com> wrote: > > > Hi everyone! It's been awhile. I've got a feature I'd love to get some > > feedback on and contribute to the community, if it's acceptable. I need > to > > brush up on the proper process (did read CONTRIBUTING.md). > > Not a lot has changed. The technical discussions have tended to move to > github issues but there is of course a push back on that as it doesn't > comply with apache bylaws. > > > I should have > > discussed this *before* implementation, for sure, but since this is > > something I've already got I figured I'd use it to go through the process > > and refresh myself on the latest. > > > > https://github.com/apache/cloudstack/pull/5552 > > I like the concept you describe. > > > > > > > > If you're familiar with the KVM agent, I needed to provide a way for > > long-running jobs to be able to react and clean up their work when the > > agent or management server is stopped, or they just lose connectivity > with > > each other. Currently, if the management server is restarted while the > > agent is working on something, the agent-side work could continue on and > > complete, but the management server would fail the job. This is ok in > many > > circumstances, but sometimes this can lead to cruft like copied files > that > > are never used. > > > as stated on the PR, it is not only a problem for KVM (but one thing at a > time) > > > > > > I'm not entirely happy with this as there's potential for race, > > particularly in de-registration of the hook, but it seems like a > reasonable > > start. It just requires the coder of the hook to understand that a > rollback > > could be attempted even if the bulk of their task has completed, or not > > started yet, and account for that, whether they pass a lock, or do a > "try", > > or something else. > > > reconnect could include a 'status of work' dialog, I imagine this is what > you mean by "requires the coder .. to understand". If the MS forgot about a > job, the agent will probably not and can send the information back. > > let's see your first work though first (as in +1) > > > -- > Daan >