On Wed, Mar 17, 2021 at 11:35:31AM +0300, Vladimir Sementsov-Ogievskiy wrote: > 15.03.2021 09:06, Roman Kagan wrote: > > The reconnection logic doesn't need to stop while in a drained section. > > Moreover it has to be active during the drained section, as the requests > > that were caught in-flight with the connection to the server broken can > > only usefully get drained if the connection is restored. Otherwise such > > requests can only either stall resulting in a deadlock (before > > 8c517de24a), or be aborted defeating the purpose of the reconnection > > machinery (after 8c517de24a). > > > > This series aims to just stop messing with the drained section in the > > reconnection code. > > > > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict > > connection_co reentrance"); as I've missed the point of that commit I'd > > appreciate more scrutiny in this area. > > > > Roman Kagan (7): > > block/nbd: avoid touching freed connect_thread > > block/nbd: use uniformly nbd_client_connecting_wait > > block/nbd: assert attach/detach runs in the proper context > > block/nbd: transfer reconnection stuff across aio_context switch > > block/nbd: better document a case in nbd_co_establish_connection > > block/nbd: decouple reconnect from drain > > block/nbd: stop manipulating in_flight counter > > > > block/nbd.c | 191 +++++++++++++++++++++++---------------------------- > > nbd/client.c | 2 - > > 2 files changed, 86 insertions(+), 107 deletions(-) > > > > > Hmm. The huge source of problems for this series is weird logic around > drain and aio context switch in NBD driver. > > Why do we have all these too complicated logic with abuse of in_flight > counter in NBD? The answer is connection_co. NBD differs from other > drivers, it has a coroutine independent of request coroutines. And we > have to move this coroutine carefully to new aio context. We can't > just enter it from the new context, we want to be sure that > connection_co is in one of yield points that supports reentering. > > I have an idea of how to avoid this thing: drop connection_co at all. > > 1. nbd negotiation goes to connection thread and becomes independent > of any aio context. > > 2. waiting for server reply goes to request code. So, instead of > reading the replay from socket always in connection_co, we read in the > request coroutine, after sending the request. We'll need a CoMutex for > it (as only one request coroutine should read from socket), and be > prepared to coming reply is not for _this_ request (in this case we > should wake another request and continue read from socket). > > but this may be too much for soft freeze.
This approach does look appealing to me, and I gave it a quick shot but the amount of changes this involves exceeds the rc tolerance indeed. > Another idea: > > You want all the requests be completed on drain_begin(), not > cancelled. Actually, you don't need reconnect runnning during drained > section for it. It should be enough just wait for all currenct > requests before disabling the reconnect in drain_begin handler. So effectively you suggest doing nbd's own drain within bdrv_co_drain_begin callback. I'm not totally sure there are no assumptions this may break, but I'll try to look into this possibility. Thanks, Roman.