Hi everyone! It's been awhile. I've got a feature I'd love to get some feedback on and contribute to the community, if it's acceptable. I need to brush up on the proper process (did read CONTRIBUTING.md). I should have discussed this *before* implementation, for sure, but since this is something I've already got I figured I'd use it to go through the process and refresh myself on the latest.
https://github.com/apache/cloudstack/pull/5552 If you're familiar with the KVM agent, I needed to provide a way for long-running jobs to be able to react and clean up their work when the agent or management server is stopped, or they just lose connectivity with each other. Currently, if the management server is restarted while the agent is working on something, the agent-side work could continue on and complete, but the management server would fail the job. This is ok in many circumstances, but sometimes this can lead to cruft like copied files that are never used. I'm not entirely happy with this as there's potential for race, particularly in de-registration of the hook, but it seems like a reasonable start. It just requires the coder of the hook to understand that a rollback could be attempted even if the bulk of their task has completed, or not started yet, and account for that, whether they pass a lock, or do a "try", or something else.