At Tue, 3 Jan 2012 08:13:51 +0000, Stefan Hajnoczi wrote: > > On Mon, Jan 02, 2012 at 10:38:11PM +0000, Stefan Hajnoczi wrote: > > On Mon, Jan 2, 2012 at 3:39 PM, Christoph Hellwig <h...@lst.de> wrote: > > > On Fri, Dec 30, 2011 at 10:35:01AM +0000, Stefan Hajnoczi wrote: > > >> If you can reproduce this bug and suspect coroutines are involved then I > > > > > > It's entirely reproducable. > > > > > > I've played around a bit and switched from the ucontext to the gthreads > > > coroutine implementation. The result seems odd, but starts to make sense. > > > > > > Running the workload I now get the following message from qemu: > > > > > > Co-routine re-entered recursively > > > > > > and the gdb backtrace looks like: > > > > > > (gdb) bt > > > #0 0x00007f2fff36f405 in *__GI_raise (sig=<optimized out>) > > > at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > > > #1 0x00007f2fff372680 in *__GI_abort () at abort.c:92 > > > #2 0x00007f30019a6616 in qemu_coroutine_enter (co=0x7f3004d4d7b0, > > > opaque=0x0) > > > at qemu-coroutine.c:53 > > > #3 0x00007f30019a5e82 in qemu_co_queue_next_bh (opaque=<optimized out>) > > > at qemu-coroutine-lock.c:43 > > > #4 0x00007f30018d5a72 in qemu_bh_poll () at async.c:71 > > > #5 0x00007f3001982990 in main_loop_wait (nonblocking=<optimized out>) > > > at main-loop.c:472 > > > #6 0x00007f30018cf714 in main_loop () at /home/hch/work/qemu/vl.c:1481 > > > #7 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized > > > out>) > > > at /home/hch/work/qemu/vl.c:3479 > > > > > > adding some printks suggest this happens when calling add_aio_request from > > > aio_read_response when either delaying creates, or updating metadata, > > > although not everytime one of these cases happens. > > > > > > I've tried to understand how the recursive calling happens, but > > > unfortunately > > > the whole coroutine code lacks any sort of documentation how it should > > > behave or what it asserts about the callers. > > > > > >> I don't have a sheepdog setup here but if there's an easy way to > > >> reproduce please let me know and I'll take a look. > > > > > > With the small patch below applied to the sheppdog source I can reproduce > > > the issue on my laptop using the following setup: > > > > > > for port in 7000 7001 7002; do > > > mkdir -p /mnt/sheepdog/$port > > > /usr/sbin/sheep -p $port -c local /mnt/sheepdog/$port > > > sleep 2 > > > done > > > > > > collie cluster format > > > collie vdi create test 20G > > > > > > then start a qemu instance that uses the the sheepdog volume using the > > > following device and drive lines: > > > > > > -drive if=none,file=sheepdog:test,cache=none,id=test \ > > > -device virtio-blk-pci,drive=test,id=testdev \ > > > > > > finally, in the guest run: > > > > > > dd if=/dev/zero of=/dev/vdX bs=67108864 count=128 oflag=direct > > > > Thanks for these instructions. I can reproduce the issue here. > > > > It seems suspicious the way that BDRVSheepdogState->co_recv and > > ->co_send work. The code adds select(2) read/write callback functions > > on the sheepdog socket file descriptor. When the socket becomes > > writeable or readable the co_send or co_recv coroutines are entered. > > So far, so good, this is how a coroutine is integrated into the main > > loop of QEMU. > > > > The problem is that this patch is mixing things. The co_recv > > coroutine runs aio_read_response(), which invokes send_pending_req(). > > send_pending_req() invokes add_aio_request(). That function isn't > > suitable for co_recv's context because it actually sends data and hits > > a few blocking (yield) points. It takes a coroutine mutex - but the > > select(2) read callback is still in place. We're now still in the > > aio_read_response() call chain except we're actually not reading at > > all, we're trying to write! And we'll get spurious wakeups if there > > is any data readable on the socket. > > > > So the co_recv coroutine has two things in the system that will try to > > enter it: > > 1. The select(2) read callback on the sheepdog socket. > > 2. The aio_add_request() blocking operations, including a coroutine mutex. > > > > This is bad, a yielded coroutine should only have one thing that will > > enter it. It's rare that it makes sense to have multiple things > > entering a coroutine. > > > > It's late here but I hope this gives Kazutaka some thoughts on what is > > causing the issue with this patch. > > Poked around some more this morning. Here is a patch that isolates the > bug: > > ---8<--- > diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c > index 26ad76b..0d7a03f 100644 > --- a/qemu-coroutine-lock.c > +++ b/qemu-coroutine-lock.c > @@ -59,6 +59,16 @@ void coroutine_fn qemu_co_queue_wait(CoQueue *queue) > QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next); > qemu_coroutine_yield(); > assert(qemu_in_coroutine()); > + { > + Coroutine *co; > + > + QTAILQ_FOREACH(co, &queue->entries, co_queue_next) { > + assert(co != self); > + } > + QTAILQ_FOREACH(co, &unlock_bh_queue, co_queue_next) { > + assert(co != self); > + } > + } > } > > void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue) > ---8<--- > > When you run this with ucontext or GThread coroutines you hit this > assertion failure: > > qemu-system-x86_64: qemu-coroutine-lock.c:66: qemu_co_queue_wait: Assertion > `co != self' failed. > > Here is an explanation for what the asserts are checking and how we > violate the constraint: > > qemu_co_queue_next() and qemu_co_queue_next_bh() take a waiting > Coroutine off the wait queue and put it onto the unlock bh queue. When > the BH executes the coroutines from the unlock bh queue are popped off > that queue and entered. This is how coroutine wait queues work, and > coroutine mutexes are built on coroutine wait queues. > > The problem is that due to the spurious wakeups introduced in the > sheepdog patch we are acquiring the mutex earlier than normal with a BH > unlock still pending. Our coroutine can actually terminate by returning > from sheepdog.c:aio_read_respond() while the BH is still pending. When > we get around to executing the BH we're going to operate on a freed > coroutine - BOOM!
Thanks for your detailed explanation. I confirmed the following hack fixes this problem. ---8<--- diff --git a/block/sheepdog.c b/block/sheepdog.c index 17a79be..b27c770 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -626,6 +626,9 @@ static void coroutine_fn aio_read_response(void *opaque) switch (acb->aiocb_type) { case AIOCB_WRITE_UDATA: + /* this coroutine context is no longer suitable for co_recv + * because we may send data to update vdi objects */ + s->co_recv = NULL; if (!is_data_obj(aio_req->oid)) { break; } ---8<--- > > The reason why we get different failure behavior with ucontext and > GThread is because they have different implementations and reuse > coroutines different. We've still done something illegal but the > undefined behavior happens to be different - both ucontext and GThread > are working correctly, the problem is in the sheepdog patch. > > I suggest implementing coroutine socket I/O functions for both send > *and* receive. Then sheepdog request processing becomes simple - we'll > have less callback and coroutine trickery because it's possible to write > synchronous coroutine code. > > Right now the code is a mix of callbacks and some coroutine code (which > has the flaw I explained above). Things will be much easier if the code > is converted fully to coroutines. Yes, the original patch aimed to make Sheepdog I/O asynchronous with a small change, so the current sheepdog code is not clean. It looks like a good time to make the sheepdog driver fully coroutine-based code. Thanks, Kazutaka