Am 19.10.2023 um 09:28 hat lv.mengz...@zte.com.cn geschrieben:
> On Tue, Oct 17, 2023 at 10:04PM +0800, stefa...@redhat.com wrote:
> > > From: hujian <hu.j...@zte.com.cn> 
> > >  
> > > During the stop of dataplane for virtio-blk, 
> > > virtio_bus_cleanup_host_notifier() is be
> > > called to clean up notifier at the end, if polled ioeventfd, 
> > > virtio_blk_handle_output()
> > > is used to handle io request. But due to s->dataplane_disabled is false, 
> > > it will be
> > > returned directly, which drops io request.
> > > Backtrace:
> > > ->virtio_blk_data_plane_stop
> > >   ->virtio_bus_cleanup_host_notifier
> > >     ->virtio_queue_host_notifier_read
> > >       ->virtio_queue_notify_vq
> > >         ->vq->handle_output
> > >           ->virtio_blk_handle_output
> > >             ->if (s->dataplane  && !s->dataplane_stoped)
> > >               ->if (!s->dataplane_disabled)
> > >                 ->return *
> > >             ->virtio_blk_handle_output_do
> > > The above problem can occur when using "virsh reset" cmdline to reset 
> > > guest, while
> > > guest does io.
> > > To fix this problem, don't try to start dataplane if s->stopping is true, 
> > > and io would
> > > be handled by virtio_blk_handle_vq().
> > >  
> > > Signed-off-by: hujian <hu.j...@zte.com.cn> 
> > > ---
> > >  hw/block/virtio-blk.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > I have dropped this patch again after Fiona pointed out it does not
> > compile and Kevin noticed that handling requests from the main loop
> > thread while the I/O is still being processed in the IOThread is going
> > to cause thread-safety issues.
> > 
> > Can you explain the problem you are seeing in more detail? You run
> > "virsh reset" while the guest is doing I/O. Then what happens?
> > 
> > Stefan
> 
> 1 Compilation issues
> I'm sorry to be in such a hurry to submit the patch that I forgot to
> compile it locally.  Compilable patches are at the bottom.

The more worrying part is that without building it, you also can't have
tested it. How much testing did the new, compilable patch undergo?

> 2 Troubleshooting
> We have done a lifecycle test for the VM (QEMU version: 4.1.0, Host
> kernel version: 4.18), which is loop execution: virsh create -> virsh
> suspend -> virsh resume -> virsh reset -> virsh shutdown, and io
> stress test inside the virtual machine. After the loop is executed
> about 200 times, after "virsh reset" is executed, the virtual machine
> goes into emergency mode and fails to start normally. Theoretically,
> "virsh reset" may causes data loss of virtual machine, but not enough
> to put it into emergency mode.

QEMU 4.1.0 is quite old. Do you happen to know if the problem is still
reproducible on current git master (or 8.1.0)?

Though from your description of the bug and what the code looks like
today, I would expect so.

> Coincidentally, I happen to be fixing another different fault with the
> same phenomenon, which is caused by a our private patch, this patch
> calls virtio_blk_data_plane_ [stop|start] to operate the dataplane, if
> QEMU is processing io requests at same time, it may cause the loss of
> io requests.
> 
> Analyzing virtio_blk_data_plane_stop(),
> virtio_bus_cleanup_host_notifier() is used to clean up notifiers, and
> my understanding is that it would handle the remaining IO requests.

Well, as you found out, it doesn't actually do that in practice. I'm
not sure what the idea behind it was. Handling the remaining requests in
our case would be a problem because we're not in the right state to
handle requests.

> The stack is as follows, I add the print at the star line and find
> that virtio_blk_handle_output() returned directly instead of
> continuing to call virtio_blk_handle_vq() to handle io. So I modify
> the code here to make it don't return during the stop of dataplane,
> and our internal private patch works normally.
> 
> Backtrace:
> ->virtio_blk_data_plane_stop
>   ->virtio_bus_cleanup_host_notifier
>     ->virtio_queue_host_notifier_read
>       ->virtio_queue_notify_vq
>         ->vq->handle_output
>           ->virtio_blk_handle_output
>             ->if (s->dataplane  && !s->dataplane_stoped)
>               ->if (!s->dataplane_disabled)
>                 ->return *
>             ->virtio_blk_handle_vq
> 
> Back to the problem caused by the virsh reset, libvirt sends the
> "system_reset" QEMU monitor command to QEMU, and QEMU calls
> qmp_system_reset(), and eventually calls
> virtio_blk_data_plane_[stop|start] to reset devices. I suspect that io
> loss will also occur if QEMU still has io to process during the stop
> of dataplane. 

Hm... But why can the guest expect that requests are processed during a
hard reset? Intuitively dropping remaining requests after a certain
point where you do a reset doesn't sound like a problem.

> 3 Thread-safety issues
> virtio_blk_data_plane_stop() calls blk_set_aio_context() to switch bs
> back to the QEMU main loop after virtio_bus_cleanup_host_notifier(),
> so the remaining IO requests are still handling by iothread(if
> configured). I'm a little confused as to why there is thread-safety
> issues.

blk_set_aio_context() only tells the BlockBackend which iothread to
expect requests from. The device needs to take care to actually send
requests from the same thread that it promised to the BlockBackend. In
your stack trace, we still promise that requests come from the iothread,
but we're actually trying to process them in the main thread:
virtio_blk_data_plane_stop() runs in the main thread and indirectly
calls virtio_blk_handle_output(), so with your fix, they would be
handled in the main thread before blk_set_aio_context() was called. This
isn't right.

We would have to delay processing requests until after we have
completely switched back to the non-dataplane mode. And after thinking a
bit about this, this seems to be exactly what the current code is doing:
In fact, returning from virtio_blk_handle_output() doesn't drop the
request. It just means that we're not processing it right now, it still
remains queued (there is no virtqueue_pop() involved when we return
early).

So I think returning there early is the right thing to do.

I'm not entirely sure where your problem happens. What would make most
sense is that virtio_blk_reset() throws away the requests. But throwing
away unprocessed requests is exactly what I would expect from a reset,
so I don't really see a problem in that.

We probably need to dig a bit deeper to figure out why your guest has a
problem with this. Can you give some details about the guest? Like which
OS, filesystem, other storage related configuration etc.? At this point
it sounds a bit like a guest side bug.

> Lastly, please CC to cv11...@126.com, this is my private email, so I
> can contact with you in my free time, Thanks.

Ok. What I usually do in such cases is that I already CC myself with the
alternative address to make sure that nobody forgets to add it while
replying.

Kevin


Reply via email to