On 13.03.2019 19:04, Kevin Wolf wrote: > Am 14.12.2018 um 12:54 hat Denis Plotnikov geschrieben: >> On 13.12.2018 15:20, Kevin Wolf wrote: >>> Am 13.12.2018 um 12:07 hat Denis Plotnikov geschrieben: >>>> Sounds it should be so, but it doesn't work that way and that's why: >>>> when doing mirror we may resume postponed coroutines too early when the >>>> underlying bs is protected from writing at and thus we encounter the >>>> assert on a write request execution at bdrv_co_write_req_prepare when >>>> resuming the postponed coroutines. >>>> >>>> The thing is that the bs is protected for writing before execution of >>>> bdrv_replace_node at mirror_exit_common and bdrv_replace_node calls >>>> bdrv_replace_child_noperm which, in turn, calls child->role->drained_end >>>> where one of the callbacks is blk_root_drained_end which check >>>> if(--blk->quiesce_counter == 0) and runs the postponed requests >>>> (coroutines) if the coundition is true. >>> >>> Hm, so something is messed up with the drain sections in the mirror >>> driver. We have: >>> >>> bdrv_drained_begin(target_bs); >>> bdrv_replace_node(to_replace, target_bs, &local_err); >>> bdrv_drained_end(target_bs); >>> >>> Obviously, the intention was to keep the BlockBackend drained during >>> bdrv_replace_node(). So how could blk->quiesce_counter ever get to 0 >>> inside bdrv_replace_node() when target_bs is drained? >>> >>> Looking at bdrv_replace_child_noperm(), it seems that the function has >>> a bug: Even if old_bs and new_bs are both drained, the quiesce_counter >>> for the parent reaches 0 for a moment because we call .drained_end for >>> the old child first and .drained_begin for the new one later. >>> >>> So it seems the fix would be to reverse the order and first call >>> .drained_begin for the new child and then .drained_end for the old >>> child. Sounds like a good new testcase for tests/test-bdrv-drain.c, too. >> Yes, it's true, but it's not enough... > > Did you ever implement the changes suggested so far, so that we could > continue from there? Or should I try and come up with something myself?
Sorry for the late reply... Yes, I did ... > >> In mirror_exit_common() we actively manipulate with block driver states. >> When we replaced a node in the snippet you showed we can't allow the >> postponed coroutines to run because the block tree isn't ready to >> receive the requests yet. >> To be ready, we need to insert a proper block driver state to the block >> backend which is done here >> >> blk_remove_bs(bjob->blk); >> blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, &error_abort); >> blk_insert_bs(bjob->blk, mirror_top_bs, &error_abort); << << << << >> >> bs_opaque->job = NULL; >> >> bdrv_drained_end(src); > > Did you actually encounter a bug here or is this just theory? bjob->blk > is the BlockBackend of the job and isn't in use at this point any more. > We only insert the old node in it again because block_job_free() must > set bs->job = NULL, and it gets bs with blk_bs(bjob->blk). > > So if there is an actual bug here, I don't understand it yet. And did encounter the bug that I described above. When a postponed coroutine resumes it fails on assert: bdrv_co_write_req_prepare: Assertion `child->perm & BLK_PERM_WRITE' failed That's why it happens: we have the mirror filter bds in blk root which receives all the requests. On mirror completion we call mirror_exit_common to finish mirroring. To finish mirroring we need to remove the mirror filter from the graph and set mirror file blk root. We call block_job_complete. Assume the ide request has came after the completion calling and has been postponed because blk->quiesce_counter is not 0. block_job_complete does mirror_exit_common which drops the permissions. /* We don't access the source any more. Dropping any WRITE/RESIZE is * required before it could become a backing file of target_bs. */ bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL, &error_abort); then, it replaces the source with the target /* The mirror job has no requests in flight any more, but we need to * drain potential other users of the BDS before changing the graph. */ // here, target_bs has no parents and doesn't begin to draing bdrv_drained_begin(target_bs); // after execution of the function below // target bs has mirror_top_bs->backing as a parent bdrv_replace_node(to_replace, target_bs, &local_err); // now target_bs has source's blk as a parent // the following call sets blk->quiesce_counter to 0 // and executes the postponed coroutine on blk with // mirror filter set which eventually does writing // on mirror_top_bs->backing child which has no writing // (and reading) permissions bdrv_drained_end(target_bs); Does it make thing more clear? Denis > >> If the tree isn't ready and we resume the coroutines, we'll end up with >> the request landed in a wrong block driver state. >> >> So, we explicitly should stop all activities on all the driver states >> and its parents and allow the activities when everything is ready to go. >> >> Why explicitly, because the block driver states may belong to different >> block backends at the moment of the manipulation beginning. >> >> So, it seems we need to disable all their contexts until the >> manipulation ends. > > If there actually is a bug, it is certainly not solved by calling > aio_disable_external() (it is bad enough that this even exists), but by > keeping the node drained. > > Kevin > -- Best, Denis