Re: [Qemu-devel] coroutine bug?, was Re: [PATCH] sheepdog: use coroutines

MORITA Kazutaka Fri, 06 Jan 2012 03:08:19 -0800

At Tue, 3 Jan 2012 08:13:51 +0000,
Stefan Hajnoczi wrote:
> 
> On Mon, Jan 02, 2012 at 10:38:11PM +0000, Stefan Hajnoczi wrote:
> > On Mon, Jan 2, 2012 at 3:39 PM, Christoph Hellwig <h...@lst.de> wrote:
> > > On Fri, Dec 30, 2011 at 10:35:01AM +0000, Stefan Hajnoczi wrote:
> > >> If you can reproduce this bug and suspect coroutines are involved then I
> > >
> > > It's entirely reproducable.
> > >
> > > I've played around a bit and switched from the ucontext to the gthreads
> > > coroutine implementation.  The result seems odd, but starts to make sense.
> > >
> > > Running the workload I now get the following message from qemu:
> > >
> > > Co-routine re-entered recursively
> > >
> > > and the gdb backtrace looks like:
> > >
> > > (gdb) bt
> > > #0  0x00007f2fff36f405 in *__GI_raise (sig=<optimized out>)
> > >    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> > > #1  0x00007f2fff372680 in *__GI_abort () at abort.c:92
> > > #2  0x00007f30019a6616 in qemu_coroutine_enter (co=0x7f3004d4d7b0, 
> > > opaque=0x0)
> > >    at qemu-coroutine.c:53
> > > #3  0x00007f30019a5e82 in qemu_co_queue_next_bh (opaque=<optimized out>)
> > >    at qemu-coroutine-lock.c:43
> > > #4  0x00007f30018d5a72 in qemu_bh_poll () at async.c:71
> > > #5  0x00007f3001982990 in main_loop_wait (nonblocking=<optimized out>)
> > >    at main-loop.c:472
> > > #6  0x00007f30018cf714 in main_loop () at /home/hch/work/qemu/vl.c:1481
> > > #7  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized 
> > > out>)
> > >    at /home/hch/work/qemu/vl.c:3479
> > >
> > > adding some printks suggest this happens when calling add_aio_request from
> > > aio_read_response when either delaying creates, or updating metadata,
> > > although not everytime one of these cases happens.
> > >
> > > I've tried to understand how the recursive calling happens, but 
> > > unfortunately
> > > the whole coroutine code lacks any sort of documentation how it should
> > > behave or what it asserts about the callers.
> > >
> > >> I don't have a sheepdog setup here but if there's an easy way to
> > >> reproduce please let me know and I'll take a look.
> > >
> > > With the small patch below applied to the sheppdog source I can reproduce
> > > the issue on my laptop using the following setup:
> > >
> > > for port in 7000 7001 7002; do
> > >    mkdir -p /mnt/sheepdog/$port
> > >    /usr/sbin/sheep -p $port -c local /mnt/sheepdog/$port
> > >    sleep 2
> > > done
> > >
> > > collie cluster format
> > > collie vdi create test 20G
> > >
> > > then start a qemu instance that uses the the sheepdog volume using the
> > > following device and drive lines:
> > >
> > >        -drive if=none,file=sheepdog:test,cache=none,id=test \
> > >        -device virtio-blk-pci,drive=test,id=testdev \
> > >
> > > finally, in the guest run:
> > >
> > >        dd if=/dev/zero of=/dev/vdX bs=67108864 count=128 oflag=direct
> > 
> > Thanks for these instructions.  I can reproduce the issue here.
> > 
> > It seems suspicious the way that BDRVSheepdogState->co_recv and
> > ->co_send work.  The code adds select(2) read/write callback functions
> > on the sheepdog socket file descriptor.  When the socket becomes
> > writeable or readable the co_send or co_recv coroutines are entered.
> > So far, so good, this is how a coroutine is integrated into the main
> > loop of QEMU.
> > 
> > The problem is that this patch is mixing things.  The co_recv
> > coroutine runs aio_read_response(), which invokes send_pending_req().
> > send_pending_req() invokes add_aio_request().  That function isn't
> > suitable for co_recv's context because it actually sends data and hits
> > a few blocking (yield) points.  It takes a coroutine mutex - but the
> > select(2) read callback is still in place.  We're now still in the
> > aio_read_response() call chain except we're actually not reading at
> > all, we're trying to write!  And we'll get spurious wakeups if there
> > is any data readable on the socket.
> > 
> > So the co_recv coroutine has two things in the system that will try to 
> > enter it:
> > 1. The select(2) read callback on the sheepdog socket.
> > 2. The aio_add_request() blocking operations, including a coroutine mutex.
> > 
> > This is bad, a yielded coroutine should only have one thing that will
> > enter it.  It's rare that it makes sense to have multiple things
> > entering a coroutine.
> > 
> > It's late here but I hope this gives Kazutaka some thoughts on what is
> > causing the issue with this patch.
> 
> Poked around some more this morning.  Here is a patch that isolates the
> bug:
> 
> ---8<---
> diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
> index 26ad76b..0d7a03f 100644
> --- a/qemu-coroutine-lock.c
> +++ b/qemu-coroutine-lock.c
> @@ -59,6 +59,16 @@ void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
>      QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
>      qemu_coroutine_yield();
>      assert(qemu_in_coroutine());
> +    {
> +        Coroutine *co;
> +
> +        QTAILQ_FOREACH(co, &queue->entries, co_queue_next) {
> +            assert(co != self);
> +        }
> +        QTAILQ_FOREACH(co, &unlock_bh_queue, co_queue_next) {
> +            assert(co != self);
> +        }
> +    }
>  }
>  
>  void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue)
> ---8<---
> 
> When you run this with ucontext or GThread coroutines you hit this
> assertion failure:
> 
> qemu-system-x86_64: qemu-coroutine-lock.c:66: qemu_co_queue_wait: Assertion 
> `co != self' failed.
> 
> Here is an explanation for what the asserts are checking and how we
> violate the constraint:
> 
> qemu_co_queue_next() and qemu_co_queue_next_bh() take a waiting
> Coroutine off the wait queue and put it onto the unlock bh queue.  When
> the BH executes the coroutines from the unlock bh queue are popped off
> that queue and entered.  This is how coroutine wait queues work, and
> coroutine mutexes are built on coroutine wait queues.
> 
> The problem is that due to the spurious wakeups introduced in the
> sheepdog patch we are acquiring the mutex earlier than normal with a BH
> unlock still pending.  Our coroutine can actually terminate by returning
> from sheepdog.c:aio_read_respond() while the BH is still pending.  When
> we get around to executing the BH we're going to operate on a freed
> coroutine - BOOM!


Thanks for your detailed explanation.  I confirmed the following hack
fixes this problem.

---8<---
diff --git a/block/sheepdog.c b/block/sheepdog.c
index 17a79be..b27c770 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -626,6 +626,9 @@ static void coroutine_fn aio_read_response(void *opaque)
 
     switch (acb->aiocb_type) {
     case AIOCB_WRITE_UDATA:
+        /* this coroutine context is no longer suitable for co_recv
+         * because we may send data to update vdi objects */
+        s->co_recv = NULL;
         if (!is_data_obj(aio_req->oid)) {
             break;
         }
---8<---

> 
> The reason why we get different failure behavior with ucontext and
> GThread is because they have different implementations and reuse
> coroutines different.  We've still done something illegal but the
> undefined behavior happens to be different - both ucontext and GThread
> are working correctly, the problem is in the sheepdog patch.
> 
> I suggest implementing coroutine socket I/O functions for both send
> *and* receive.  Then sheepdog request processing becomes simple - we'll
> have less callback and coroutine trickery because it's possible to write
> synchronous coroutine code.
> 
> Right now the code is a mix of callbacks and some coroutine code (which
> has the flaw I explained above).  Things will be much easier if the code
> is converted fully to coroutines.

Yes, the original patch aimed to make Sheepdog I/O asynchronous with a
small change, so the current sheepdog code is not clean.  It looks
like a good time to make the sheepdog driver fully coroutine-based
code.

Thanks,

Kazutaka

Re: [Qemu-devel] coroutine bug?, was Re: [PATCH] sheepdog: use coroutines

Reply via email to