On 20.04.2016 02:03, Matthew Schumacher wrote: > Max, > > Qemu still crashes for me, but the debug is again very different. When > I attach to the qemu process from gdb, it is unable to provide a > backtrace when it crashes. The log file is different too. Any ideas? > > qemu-system-x86_64: block.c:2307: bdrv_replace_in_backing_chain: > Assertion `!bdrv_requests_pending(old)' failed.
This message is exactly the same as you saw in 2.5.1, so I guess we've at least averted a regression in 2.6.0. I'm CC-ing some people who are more involved with this (although Paolo is on PTO right now, but well...). (The following is more of a note to those people than to you, Matthew.) Summary: I think bdrv_drained_begin() does not behave as advertised. So the assertion that is failing here asserts that no requests are pending on the mirror block jobs source BDS. However, we do invoke a bdrv_drained_begin() on exactly that BDS at the end of mirror_run(). When that function returns, there are indeed no more requests pending for that BDS. But once mirror_exit() is invoked, there may be new requests pending. I reproduced that by running bonnie++ in a guest and then just committed a snapshot and invoked block-job-complete right after the BLOCK_JOB_READY event; sometimes, in bdrv_requests_pending(s->common.bs) is true in mirror_exit() (which is bad), sometimes it's false. I just used a plain virtio-blk drive without dataplane. I'm not sure exactly how bdrv_drained_begin() and in turn aio_disable_external() are supposed to work, but as a matter of fact a BDS may receive requests even after those functions are called. Just putting an assert(!bs->quiesce_counter) in tracked_request_begin() will make it fail even before I started the mirror block job (due to some flush). So in my case the problematic request regarding the mirroring comes from blk_aio_ready_entry(); putting an assert(!blk_bs(blk)->quiesce_counter) into blk_aio_readv() yields the following backtrace: #0 0x00007f3e750bd2a8 in raise () from /usr/lib/libc.so.6 No symbol table info available. #1 0x00007f3e750be72a in abort () from /usr/lib/libc.so.6 No symbol table info available. #2 0x00007f3e750b61b7 in __assert_fail_base () from /usr/lib/libc.so.6 No symbol table info available. #3 0x00007f3e750b6262 in __assert_fail () from /usr/lib/libc.so.6 No symbol table info available. #4 0x0000564cf7d4e25e in blk_aio_readv (blk=<optimized out>, sector_num=<optimized out>, iov=<optimized out>, nb_sectors=<optimized out>, cb=<optimized out>, opaque=<optimized out>) at qemu/block/block-backend.c:1002 __PRETTY_FUNCTION__ = "blk_aio_readv" #5 0x0000564cf7ab2cf3 in submit_requests (niov=<optimized out>, num_reqs=<optimized out>, start=<optimized out>, mrb=<optimized out>, blk=<optimized out>) at qemu/hw/block/virtio-blk.c:361 nb_sectors = <optimized out> is_write = <optimized out> qiov = <optimized out> sector_num = <optimized out> #6 virtio_blk_submit_multireq (blk=0x564cf9f80250, mrb=mrb@entry=0x7ffeffbfce40) at qemu/hw/block/virtio-blk.c:391 i = <optimized out> start = <optimized out> num_reqs = <optimized out> niov = <optimized out> nb_sectors = <optimized out> max_xfer_len = <optimized out> sector_num = <optimized out> #7 0x0000564cf7ab38c2 in virtio_blk_handle_vq (s=0x564cf9e51268, vq=<optimized out>) at qemu/hw/block/virtio-blk.c:593 req = 0x0 mrb = {reqs = {0x564cfb8e8c30, 0x564cfb7bc290, 0x0 <repeats 30 times>}, num_reqs = 2, is_write = false} #8 0x0000564cf7addcf5 in virtio_queue_notify_vq (vq=0x564cfa000be0) at qemu/hw/virtio/virtio.c:1108 vdev = 0x564cf9e51268 #9 0x0000564cf7d19980 in aio_dispatch (ctx=0x564cf9e42f40) at qemu/aio-posix.c:327 tmp = <optimized out> revents = <optimized out> node = 0x7f3e54015030 progress = false #10 0x0000564cf7d0eecd in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at qemu/async.c:233 ctx = <optimized out> #11 0x00007f3e781d7f07 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0 No symbol table info available. #12 0x0000564cf7d1803b in glib_pollfds_poll () at qemu/main-loop.c:213 context = 0x564cf9e44800 pfds = <optimized out> #13 os_host_main_loop_wait (timeout=<optimized out>) at qemu/main-loop.c:258 ret = 2 spin_counter = 2 #14 main_loop_wait (nonblocking=<optimized out>) at qemu/main-loop.c:506 ret = 2 timeout = 1000 timeout_ns = <optimized out> #15 0x0000564cf7a4c91c in main_loop () at qemu/vl.c:1934 nonblocking = <optimized out> last_io = 0 #16 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at qemu/vl.c:4658 Maybe bdrv_drained_begin() is supposed to work like this and to let this request through but that would be pretty counter-intuitive. Max
signature.asc
Description: OpenPGP digital signature