Hi; downstream we've run into an issue where VMs under heavy load with
many simultaneously concurrent block jobs running might occasionally
flicker into the STANDBY state, during which time they will be unable to
receive JOB COMPLETE commands. I assume this flicker is due to
child_job_drained_begin().
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1945635
It's safe to just retry this operation again, but it may be difficult to
understand WHY the job is paused at the application level, since the
flush event may be asynchronous and unpredictable.
We could define a transition to allow COMPLETE to be applied to STANDBY
jobs, but is there any risk or drawback to doing so? On QMP's side, we
do know the difference between a temporary pause and a user pause/error
pause (Both use the user_pause flag.)
I imagine it's safe to continue rejecting COMPLETE commands if
user_paused is set ("No, go fix this first!") and we could define a
pathway for implicitly STANDBY jobs only. However, in this case, we
don't really know how long STANDBY will last. Do we have the ability to
easily accept an async "intent" to complete a job without tying up the
monitor?
ATM I think only mirror uses .complete, but it looks like it tries to
actually set up the pivot a good deal before delegating to the bottom
half, so I worry it's not safe to try to run this when we are in the
middle of a drain.
Any thoughts?
--js