Re: [Qemu-devel] [PATCH] blk: postpone request execution on a context protected with "drained section"

Denis Plotnikov Mon, 14 Jan 2019 23:23:28 -0800

ping ping ping ping!!!!

On 09.01.2019 11:18, Denis Plotnikov wrote:
> ping ping!!!
> 
> On 18.12.2018 11:53, Denis Plotnikov wrote:
>> ping ping
>>
>> On 14.12.2018 14:54, Denis Plotnikov wrote:
>>>
>>>
>>> On 13.12.2018 15:20, Kevin Wolf wrote:
>>>> Am 13.12.2018 um 12:07 hat Denis Plotnikov geschrieben:
>>>>> On 12.12.2018 15:24, Kevin Wolf wrote:
>>>>>> Am 11.12.2018 um 17:55 hat Denis Plotnikov geschrieben:
>>>>>>>> Why involve the AioContext at all? This could all be kept at the
>>>>>>>> BlockBackend level without extending the layering violation that
>>>>>>>> aio_disable_external() is.
>>>>>>>>
>>>>>>>> BlockBackends get notified when their root node is drained, so 
>>>>>>>> hooking
>>>>>>>> things up there should be as easy, if not even easier than in
>>>>>>>> AioContext.
>>>>>>>
>>>>>>> Just want to make sure that I understood correctly what you meant by
>>>>>>> "BlockBackends get notified". Did you mean that bdrv_drain_end calls
>>>>>>> child's role callback blk_root_drained_end by calling
>>>>>>> bdrv_parent_drained_end?
>>>>>>
>>>>>> Yes, blk_root_drained_begin/end calls are all you need. Specifically,
>>>>>> their adjustments to blk->quiesce_counter that are already there, 
>>>>>> and in
>>>>>> the 'if (--blk->quiesce_counter == 0)' block of 
>>>>>> blk_root_drained_end()
>>>>>> we can resume the queued requests.
>>>>> Sounds it should be so, but it doesn't work that way and that's why:
>>>>> when doing mirror we may resume postponed coroutines too early when 
>>>>> the
>>>>> underlying bs is protected from writing at and thus we encounter the
>>>>> assert on a write request execution at bdrv_co_write_req_prepare when
>>>>> resuming the postponed coroutines.
>>>>>
>>>>> The thing is that the bs is protected for writing before execution of
>>>>> bdrv_replace_node at mirror_exit_common and bdrv_replace_node calls
>>>>> bdrv_replace_child_noperm which, in turn, calls 
>>>>> child->role->drained_end
>>>>> where one of the callbacks is blk_root_drained_end which check
>>>>> if(--blk->quiesce_counter == 0) and runs the postponed requests
>>>>> (coroutines) if the coundition is true.
>>>>
>>>> Hm, so something is messed up with the drain sections in the mirror
>>>> driver. We have:
>>>>
>>>>      bdrv_drained_begin(target_bs);
>>>>      bdrv_replace_node(to_replace, target_bs, &local_err);
>>>>      bdrv_drained_end(target_bs);
>>>>
>>>> Obviously, the intention was to keep the BlockBackend drained during
>>>> bdrv_replace_node(). So how could blk->quiesce_counter ever get to 0
>>>> inside bdrv_replace_node() when target_bs is drained?
>>>>
>>>> Looking at bdrv_replace_child_noperm(), it seems that the function has
>>>> a bug: Even if old_bs and new_bs are both drained, the quiesce_counter
>>>> for the parent reaches 0 for a moment because we call .drained_end for
>>>> the old child first and .drained_begin for the new one later.
>>>>
>>>> So it seems the fix would be to reverse the order and first call
>>>> .drained_begin for the new child and then .drained_end for the old
>>>> child. Sounds like a good new testcase for tests/test-bdrv-drain.c, 
>>>> too.
>>> Yes, it's true, but it's not enough...
>>> In mirror_exit_common() we actively manipulate with block driver states.
>>> When we replaced a node in the snippet you showed we can't allow the 
>>> postponed coroutines to run because the block tree isn't ready to 
>>> receive the requests yet.
>>> To be ready, we need to insert a proper block driver state to the 
>>> block backend which is done here
>>>
>>>      blk_remove_bs(bjob->blk);
>>>      blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, &error_abort);
>>>      blk_insert_bs(bjob->blk, mirror_top_bs, &error_abort); << << << <<
>>>
>>>      bs_opaque->job = NULL;
>>>
>>>      bdrv_drained_end(src);
>>>
>>> If the tree isn't ready and we resume the coroutines, we'll end up 
>>> with the request landed in a wrong block driver state.
>>>
>>> So, we explicitly should stop all activities on all the driver states
>>> and its parents and allow the activities when everything is ready to go.
>>>
>>> Why explicitly, because the block driver states may belong to 
>>> different block backends at the moment of the manipulation beginning.
>>>
>>> So, it seems we need to disable all their contexts until the 
>>> manipulation ends.
>>>
>>> Please, correct me if I'm wrong.
>>>
>>>>
>>>>> In seems that if the external requests disabled on the context we 
>>>>> can't
>>>>> rely on anything or should check where the underlying bs and its
>>>>> underlying nodes are ready to receive requests which sounds quite
>>>>> complicated.
>>>>> Please correct me if still don't understand something in that routine.
>>>>
>>>> I think the reason why reyling on aio_disable_external() works is 
>>>> simply
>>>> because src is also drained, which keeps external events in the
>>>> AioContext disabled despite the bug in draining the target node.
>>>>
>>>> The bug would become apparent even with aio_disable_external() if we
>>>> didn't drain src, or even if we just supported src and target being in
>>>> different AioContexts.
>>>
>>> Why don't we disable all those contexts involved until the end of the 
>>> block device tree reconstruction?
>>>
>>> Thanks!
>>>
>>> Denis
>>>>
>>>> Kevin
>>>>
>>>
>>
>


-- 
Best,
Denis

Re: [Qemu-devel] [PATCH] blk: postpone request execution on a context protected with "drained section"

Reply via email to