Kevin, could you please take a look at my last comments?
Thanks! Denis On 15.01.2019 10:22, Denis Plotnikov wrote: > ping ping ping ping!!!! > > On 09.01.2019 11:18, Denis Plotnikov wrote: >> ping ping!!! >> >> On 18.12.2018 11:53, Denis Plotnikov wrote: >>> ping ping >>> >>> On 14.12.2018 14:54, Denis Plotnikov wrote: >>>> >>>> >>>> On 13.12.2018 15:20, Kevin Wolf wrote: >>>>> Am 13.12.2018 um 12:07 hat Denis Plotnikov geschrieben: >>>>>> On 12.12.2018 15:24, Kevin Wolf wrote: >>>>>>> Am 11.12.2018 um 17:55 hat Denis Plotnikov geschrieben: >>>>>>>>> Why involve the AioContext at all? This could all be kept at the >>>>>>>>> BlockBackend level without extending the layering violation that >>>>>>>>> aio_disable_external() is. >>>>>>>>> >>>>>>>>> BlockBackends get notified when their root node is drained, so >>>>>>>>> hooking >>>>>>>>> things up there should be as easy, if not even easier than in >>>>>>>>> AioContext. >>>>>>>> >>>>>>>> Just want to make sure that I understood correctly what you >>>>>>>> meant by >>>>>>>> "BlockBackends get notified". Did you mean that bdrv_drain_end >>>>>>>> calls >>>>>>>> child's role callback blk_root_drained_end by calling >>>>>>>> bdrv_parent_drained_end? >>>>>>> >>>>>>> Yes, blk_root_drained_begin/end calls are all you need. >>>>>>> Specifically, >>>>>>> their adjustments to blk->quiesce_counter that are already there, >>>>>>> and in >>>>>>> the 'if (--blk->quiesce_counter == 0)' block of >>>>>>> blk_root_drained_end() >>>>>>> we can resume the queued requests. >>>>>> Sounds it should be so, but it doesn't work that way and that's why: >>>>>> when doing mirror we may resume postponed coroutines too early >>>>>> when the >>>>>> underlying bs is protected from writing at and thus we encounter the >>>>>> assert on a write request execution at bdrv_co_write_req_prepare when >>>>>> resuming the postponed coroutines. >>>>>> >>>>>> The thing is that the bs is protected for writing before execution of >>>>>> bdrv_replace_node at mirror_exit_common and bdrv_replace_node calls >>>>>> bdrv_replace_child_noperm which, in turn, calls >>>>>> child->role->drained_end >>>>>> where one of the callbacks is blk_root_drained_end which check >>>>>> if(--blk->quiesce_counter == 0) and runs the postponed requests >>>>>> (coroutines) if the coundition is true. >>>>> >>>>> Hm, so something is messed up with the drain sections in the mirror >>>>> driver. We have: >>>>> >>>>> bdrv_drained_begin(target_bs); >>>>> bdrv_replace_node(to_replace, target_bs, &local_err); >>>>> bdrv_drained_end(target_bs); >>>>> >>>>> Obviously, the intention was to keep the BlockBackend drained during >>>>> bdrv_replace_node(). So how could blk->quiesce_counter ever get to 0 >>>>> inside bdrv_replace_node() when target_bs is drained? >>>>> >>>>> Looking at bdrv_replace_child_noperm(), it seems that the function has >>>>> a bug: Even if old_bs and new_bs are both drained, the quiesce_counter >>>>> for the parent reaches 0 for a moment because we call .drained_end for >>>>> the old child first and .drained_begin for the new one later. >>>>> >>>>> So it seems the fix would be to reverse the order and first call >>>>> .drained_begin for the new child and then .drained_end for the old >>>>> child. Sounds like a good new testcase for tests/test-bdrv-drain.c, >>>>> too. >>>> Yes, it's true, but it's not enough... >>>> In mirror_exit_common() we actively manipulate with block driver >>>> states. >>>> When we replaced a node in the snippet you showed we can't allow the >>>> postponed coroutines to run because the block tree isn't ready to >>>> receive the requests yet. >>>> To be ready, we need to insert a proper block driver state to the >>>> block backend which is done here >>>> >>>> blk_remove_bs(bjob->blk); >>>> blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, &error_abort); >>>> blk_insert_bs(bjob->blk, mirror_top_bs, &error_abort); << << << << >>>> >>>> bs_opaque->job = NULL; >>>> >>>> bdrv_drained_end(src); >>>> >>>> If the tree isn't ready and we resume the coroutines, we'll end up >>>> with the request landed in a wrong block driver state. >>>> >>>> So, we explicitly should stop all activities on all the driver states >>>> and its parents and allow the activities when everything is ready to >>>> go. >>>> >>>> Why explicitly, because the block driver states may belong to >>>> different block backends at the moment of the manipulation beginning. >>>> >>>> So, it seems we need to disable all their contexts until the >>>> manipulation ends. >>>> >>>> Please, correct me if I'm wrong. >>>> >>>>> >>>>>> In seems that if the external requests disabled on the context we >>>>>> can't >>>>>> rely on anything or should check where the underlying bs and its >>>>>> underlying nodes are ready to receive requests which sounds quite >>>>>> complicated. >>>>>> Please correct me if still don't understand something in that >>>>>> routine. >>>>> >>>>> I think the reason why reyling on aio_disable_external() works is >>>>> simply >>>>> because src is also drained, which keeps external events in the >>>>> AioContext disabled despite the bug in draining the target node. >>>>> >>>>> The bug would become apparent even with aio_disable_external() if we >>>>> didn't drain src, or even if we just supported src and target being in >>>>> different AioContexts. >>>> >>>> Why don't we disable all those contexts involved until the end of >>>> the block device tree reconstruction? >>>> >>>> Thanks! >>>> >>>> Denis >>>>> >>>>> Kevin >>>>> >>>> >>> >> > -- Best, Denis