On Jun 3 13:14, Jonathan Derrick wrote: > When metadata is present in the namespace and deallocates are issued, the > first > deallocate could fail to zero the block range, resulting in another > deallocation to be issued. Normally after the deallocation completes and the > range is checked for zeroes, a deallocation is then issued for the metadata > space. In the failure case where the range is not zeroed, deallocation is > reissued for the block range (and followed with metadata deallocation), but > the > original range deallocation task will also issue a metadata deallocation: > > nvme_dsm_cb() > *range deallocation* > nvme_dsm_md_cb() > if (nvme_block_status_all()) (range deallocation failure) > nvme_dsm_cb() > *range deallocation* > nvme_dsm_md_cb() > if (nvme_block_status_all()) (no failure) > *metadata deallocation* > *metadata deallocation* > > This sequence results in reentry of nvme_dsm_cb() before the metadata has been > deallocated. During reentry, the metadata is deallocated in the reentrant > task. > nvme_dsm_bh() is called which deletes and sets iocb->bh to NULL. When reentry > returns from nvme_dsm_cb(), metadata deallocation takes place again, and > results in a null pointer dereference on the iocb->bh: > > BH deletion: > #0 nvme_dsm_bh (opaque=0x55ef893e2f10) at ../hw/nvme/ctrl.c:2316 > #1 0x000055ef868eb333 in aio_bh_call (bh=0x55ef8a441b30) at > ../util/async.c:141 > #2 0x000055ef868eb441 in aio_bh_poll (ctx=0x55ef892c6e40) at > ../util/async.c:169 > #3 0x000055ef868d2789 in aio_dispatch (ctx=0x55ef892c6e40) at > ../util/aio-posix.c:415 > #4 0x000055ef868eb896 in aio_ctx_dispatch (source=0x55ef892c6e40, > callback=0x0, user_data=0x0) at ../util/async.c:311 > #5 0x00007f5bfe4ab17d in g_main_context_dispatch () at > /lib/x86_64-linux-gnu/libglib-2.0.so.0 > #6 0x000055ef868fcd98 in glib_pollfds_poll () at ../util/main-loop.c:232 > #7 0x000055ef868fce16 in os_host_main_loop_wait (timeout=0) at > ../util/main-loop.c:255 > #8 0x000055ef868fcf27 in main_loop_wait (nonblocking=0) at > ../util/main-loop.c:531 > #9 0x000055ef864a2442 in qemu_main_loop () at ../softmmu/runstate.c:726 > #10 0x000055ef860f957a in main (argc=29, argv=0x7ffdc9705508, > envp=0x7ffdc97055f8) at ../softmmu/main.c:50 > > nvme_dsm_cb() called for metadata after nvme_dsm_bh() completes from > reentrant task: > Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. > 0x000055ef868eb07c in aio_bh_enqueue (bh=0x0, new_flags=2) at > ../util/async.c:70 > 70 AioContext *ctx = bh->ctx; > (gdb) backtrace > #0 0x000055ef868eb07c in aio_bh_enqueue (bh=0x0, new_flags=2) at > ../util/async.c:70 > #1 0x000055ef868eb4cf in qemu_bh_schedule (bh=0x0) at ../util/async.c:186 > #2 0x000055ef862db21e in nvme_dsm_cb (opaque=0x55ef897b41a0, ret=0) at > ../hw/nvme/ctrl.c:2423 > #3 0x000055ef8665a662 in blk_aio_complete (acb=0x55ef89c6d8c0) at > ../block/block-backend.c:1419 > #4 0x000055ef8665a940 in blk_aio_write_entry (opaque=0x55ef89c6d8c0) at > ../block/block-backend.c:1486 > #5 0x000055ef868edcf2 in coroutine_trampoline (i0=-536848976, i1=32602) at > ../util/coroutine-ucontext.c:173 > #6 0x00007f5bfe0bc510 in __start_context () at > ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 > #7 0x00007f5bf757bb40 in () > #8 0x0000000000000000 in () > > The fix is to return when an nvme_dsm_cb() is reentered due to failure to > deallocate the block range, so that metadata deallocate is then only issued in > the reentrant task and prevent doing it again when the reentrant task returns > to the original task. > > Reproduction steps (with emulated namespace): > nvme format --lbaf=1 -f /dev/nvme0n1 > mkfs.ext4 /dev/nvme0n1 > mkfs.ext4 -F /dev/nvme0n1 > > Signed-off-by: Francis Pravin AntonyX Michael Raj > <francis.mich...@solidigm.com> > Signed-off-by: Michael Kropaczek <michael.kropac...@solidigm.com> > Signed-off-by: Jonathan Derrick <jonathan.derr...@solidigm.com> > --- > hw/nvme/ctrl.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index 03760ddeae..74540a03d5 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -2372,6 +2372,7 @@ static void nvme_dsm_md_cb(void *opaque, int ret) > } > > nvme_dsm_cb(iocb, 0); > + return; > } > > iocb->aiocb = blk_aio_pwrite_zeroes(ns->blkconf.blk, nvme_moff(ns, slba), > -- > 2.25.1 >
Yeah, thanks for the elaborate analysis! A fix for this has been lingering in nvme-next. I sincerely apologize for not sending the pull request to master earlier, which might have saved you the trouble of tracking this down. I just sent the pull request to Peter.
signature.asc
Description: PGP signature