15.03.2021 09:06, Roman Kagan wrote:
The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored. Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).
Since the pieces of the reconnection logic are now properly migrated
from one aio_context to another, it appears safe to just stop messing
with the drained section in the reconnection code.
Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
I'd not think that it "fixes" it. Behavior changes.. But 5ad81b4946 didn't
introduce any bugs.
Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd
reconnect-delay")
And here..
1. There is an existing problem (unrelated to nbd) in Qemu that long io request
which we have to wait for at drained_begin may trigger a dead lock
(https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html)
2. So, when we have nbd reconnect (and therefore long io requests) we simply
trigger this deadlock.. That's why I decided to cancel the requests (assuming
they will most probably fail anyway).
I agree that nbd driver is wrong place for fixing the problem described in
(https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html), but if
you just revert 8c517de24a, you'll see the deadlock again..
--
Best regards,
Vladimir