On Tue, Oct 17, 2023 at 10:04PM +0800, stefa...@redhat.com wrote:
> > From: hujian <hu.j...@zte.com.cn> 
> >  
> > During the stop of dataplane for virtio-blk, 
> > virtio_bus_cleanup_host_notifier() is be
> > called to clean up notifier at the end, if polled ioeventfd, 
> > virtio_blk_handle_output()
> > is used to handle io request. But due to s->dataplane_disabled is false, it 
> > will be
> > returned directly, which drops io request.
> > Backtrace:
> > ->virtio_blk_data_plane_stop
> >   ->virtio_bus_cleanup_host_notifier
> >     ->virtio_queue_host_notifier_read
> >       ->virtio_queue_notify_vq
> >         ->vq->handle_output
> >           ->virtio_blk_handle_output
> >             ->if (s->dataplane  && !s->dataplane_stoped)
> >               ->if (!s->dataplane_disabled)
> >                 ->return *
> >             ->virtio_blk_handle_output_do
> > The above problem can occur when using "virsh reset" cmdline to reset 
> > guest, while
> > guest does io.
> > To fix this problem, don't try to start dataplane if s->stopping is true, 
> > and io would
> > be handled by virtio_blk_handle_vq().
> >  
> > Signed-off-by: hujian <hu.j...@zte.com.cn> 
> > ---
> >  hw/block/virtio-blk.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> I have dropped this patch again after Fiona pointed out it does not
> compile and Kevin noticed that handling requests from the main loop
> thread while the I/O is still being processed in the IOThread is going
> to cause thread-safety issues.
> 
> Can you explain the problem you are seeing in more detail? You run
> "virsh reset" while the guest is doing I/O. Then what happens?
> 
> Stefan

1 Compilation issues
I'm sorry to be in such a hurry to submit the patch that I forgot to compile it 
locally.
Compilable patches are at the bottom.

2 Troubleshooting
We have done a lifecycle test for the VM (QEMU version: 4.1.0, Host kernel 
version: 4.18),
which is loop execution: virsh create -> virsh suspend -> virsh resume -> virsh 
reset ->
virsh shutdown, and io stress test inside the virtual machine. After the loop 
is executed
about 200 times, after "virsh reset" is executed, the virtual machine goes into 
emergency
mode and fails to start normally. Theoretically, "virsh reset" may causes data 
loss of
virtual machine, but not enough to put it into emergency mode.

Coincidentally, I happen to be fixing another different fault with the same 
phenomenon,
which is caused by a our private patch, this patch calls virtio_blk_data_plane_ 
[stop|start]
to operate the dataplane, if QEMU is processing io requests at same time, it 
may cause the
loss of io requests.

Analyzing virtio_blk_data_plane_stop(), virtio_bus_cleanup_host_notifier() is 
used to
clean up notifiers, and my understanding is that it would handle the remaining 
IO requests.
The stack is as follows, I add the print at the star line and find that 
virtio_blk_handle_output()
returned directly instead of continuing to call virtio_blk_handle_vq() to 
handle io. So I modify
the code here to make it don't return during the stop of dataplane, and our 
internal private patch
works normally.

Backtrace:
->virtio_blk_data_plane_stop
  ->virtio_bus_cleanup_host_notifier
    ->virtio_queue_host_notifier_read
      ->virtio_queue_notify_vq
        ->vq->handle_output
          ->virtio_blk_handle_output
            ->if (s->dataplane  && !s->dataplane_stoped)
              ->if (!s->dataplane_disabled)
                ->return *
            ->virtio_blk_handle_vq

Back to the problem caused by the virsh reset, libvirt sends the "system_reset" 
QEMU
monitor command to QEMU, and QEMU calls qmp_system_reset(), and eventually calls
virtio_blk_data_plane_[stop|start] to reset devices. I suspect that io loss will
also occur if QEMU still has io to process during the stop of dataplane. 

3 Thread-safety issues
virtio_blk_data_plane_stop() calls blk_set_aio_context() to switch bs back to 
the QEMU
main loop after virtio_bus_cleanup_host_notifier(), so the remaining IO requests
are still handling by iothread(if configured). I'm a little confused as to why 
there
is thread-safety issues.

Lastly, please CC to cv11...@126.com, this is my private email, so I can 
contact with
you in my free time, Thanks.

4 Compilable patches
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 39e7f23..d3a719c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1165,8 +1165,9 @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBlock *s = (VirtIOBlock *)vdev;
+    VirtIOBlockDataPlane *dp = s->dataplane;

-    if (s->dataplane && !s->dataplane_started) {
+    if (s->dataplane && !s->dataplane_started && !dp->stopping) {
         /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
          * dataplane here instead of waiting for .set_status().
          */

Reply via email to