Il 06/09/2012 12:29, Kevin Wolf ha scritto: >> That's quite difficult. Completion of an I/O operation can trigger >> another I/O operation on another block device, and so on until we go >> back to the first device (think of a hypothetical RAID-5 device). > > You always have a tree of BDSes, and children should only ever trigger > completion of I/O operations in their parents. Am I missing anything?
Yes, it can only ever trigger completion in the parents. So if you had bdrv_drain in the children, it could leave other I/O pending on the siblings, but that's okay. Very nice!! I hadn't thought of the tree. >>> Doesn't change anything about this problem, though. So the options that >>> we have are: >>> >>> 1. Delay the callback using a BH. Doing this in each driver is ugly. >>> But is there actually more than one possible callback in today's >>> coroutine world? I only see bdrv_co_io_em_complete(), which could >>> reenter the coroutine from a BH. >> >> Easy and safe, but it feels a bit like a timebomb. Also, I'm not >> entirely sure of _why_ the bottom half works. :) > > Hm, safe and time bomb is contradictory in my book. :-) Well, safe for now. :) > The bottom half work because we're not reentering the qcow2_create > coroutine immediately, so the gluster AIO callback can complete all of > its cleanup work without being interrupted by code that might wait on > this particular request and create a deadlock this way. Got it now. It's just (2) with a bottom half instead of simple code reorganization. >>> 2. Delay the callback by just calling it later when the cleanup has >>> been completed and .io_flush() can return 0. You say that it's hard >>> to implement for some drivers, except if the AIOCB are leaked until >>> the end of functions like qcow2_create(). >> >> ... which is what we do in posix-aio-compat.c; nobody screamed so far. > > True. Would be easy to fix in posix-aio-compat, though, or can a > callback expect that the AIOCB is still valid? IMO no. What would you use it for, anyway? It's opaque, all you could do is bdrv_aio_cancel it. I checked all of the callers of bdrv_aio_cancel. SCSI always zeroes their aiocb pointers on entry to the AIO callback. IDE is a bit less explicit, but in the end will always zero the aiocb as well. AHCI probably has a bug there (it does not NULL the aiocb in ncq_cb). virtio and Xen do not even store the aiocb, i.e. they couldn't care less. >> Not really hard, it just has to be assessed for each driver separately. >> We can just do it in gluster and refactor it later. > > Okay, so let's keep it as an option for now. > >>> 3. Add a delay only later in functions like bdrv_drain_all() that assume >>> that the request has completed. Are there more of this type? AIOCBs >>> are leaked until a bdrv_drain_all() call. Does it work with draining >>> specific BDSes instead of everything? >>> >>> Unless I forgot some important point, it almost looks like option 1 is >>> the easiest and safest. >> >> I agree with your opinion, but I would feel better if I understood >> better why it works. (2) can be done easily in each driver (no >> ugliness) and refactored later. > > I think option 2 must be done in each driver by design, or do you see > even a theoretical way how to do it generically? Yes, it has to be done in every driver. If we added something like qemu_aio_complete(acb, ret) that does cb = acb->cb; opaque = acb->opaque; qemu_aio_release(acb); cb(opaque, ret); by converting the driver to qemu_aio_complete you would avoid the leak. Paolo