On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kw...@redhat.com> wrote: > Am 09.08.2011 14:00, schrieb Stefan Hajnoczi: >> On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote: >>> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi: >>>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefa...@gmail.com> >>>> wrote: >>>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kw...@redhat.com> wrote: >>>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: >>>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kw...@redhat.com> wrote: >>>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kw...@redhat.com> wrote: >>>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kw...@redhat.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. >>>>>>>>>>>>> useful for >>>>>>>>>>>>> changing the hostcache parameter). The problem is that closing >>>>>>>>>>>>> the file first >>>>>>>>>>>>> and then opening it again exposes us to the error case where the >>>>>>>>>>>>> open fails. >>>>>>>>>>>>> At that point we cannot get to the file anymore and our options >>>>>>>>>>>>> are to >>>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>>>>>>>> >>>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the >>>>>>>>>>>>> file descriptor >>>>>>>>>>>>> around and falling back to it should the open fail. >>>>>>>>>>>>> >>>>>>>>>>>>> The challenge for the file descriptor approach is that image >>>>>>>>>>>>> formats, like >>>>>>>>>>>>> VMDK, can span multiple files. Therefore the solution is not as >>>>>>>>>>>>> simple as >>>>>>>>>>>>> stashing a single file descriptor and reopening from it. >>>>>>>>>>>> >>>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't >>>>>>>>>>>> assume >>>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is >>>>>>>>>>>> based >>>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be >>>>>>>>>>>> hidden >>>>>>>>>>>> in raw-posix. >>>>>>>>>>>> >>>>>>>>>>>> I think something like this could do: >>>>>>>>>>>> >>>>>>>>>>>> struct BDRVReopenState { >>>>>>>>>>>> BlockDriverState *bs; >>>>>>>>>>>> /* can be extended by block drivers */ >>>>>>>>>>>> }; >>>>>>>>>>>> >>>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, >>>>>>>>>>>> int >>>>>>>>>>>> flags); >>>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>>>>>>>> >>>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. >>>>>>>>>>>> On >>>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the >>>>>>>>>>>> old >>>>>>>>>>>> one and closes the newly opened one. >>>>>>>>>>>> >>>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I >>>>>>>>>>>> had in >>>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>>>>>>>> >>>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>>>>>>>> not 100% clear on the idea. >>>>>>>>>> >>>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either >>>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a >>>>>>>>>> wrapper >>>>>>>>>> function that does both, but that's syntactic sugar). >>>>>>>>>> >>>>>>>>>> For example we would have: >>>>>>>>>> >>>>>>>>>> int vmdk_reopen() >>>>>>>>> >>>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does >>>>>>>>> not reopen anything. bdrv_prepare_reopen() might be clearer. >>>>>>>> >>>>>>>> Makes sense. >>>>>>>> >>>>>>>>> >>>>>>>>>> { >>>>>>>>>> *((VMDKReopenState**) rs) = malloc(); >>>>>>>>>> >>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>>>>>>>> if (ret < 0) >>>>>>>>>> goto fail; >>>>>>>>>> } >>>>>>>>>> return 0; >>>>>>>>>> >>>>>>>>>> fail: >>>>>>>>>> foreach (extent in rs->already_reopened) { >>>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>>> } >>>>>>>>>> return ret; >>>>>>>>>> } >>>>>>>>> >>>>>>>>>> void vmdk_reopen_commit() >>>>>>>>>> { >>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>> bdrv_reopen_commit(extent->reopen_state); >>>>>>>>>> } >>>>>>>>>> free(rs); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> void vmdk_reopen_abort() >>>>>>>>>> { >>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>>> } >>>>>>>>>> free(rs); >>>>>>>>>> } >>>>>>>>> >>>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>>>>>>>> &rs)? >>>>>>>> >>>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit. >>>>>>>> >>>>>>>> Do you have a use case where it would be helpful if the caller invoked >>>>>>>> bdrv_close? >>>>>>> >>>>>>> When the caller does bdrv_close() two BlockDriverStates are never open >>>>>>> for the same image file. I thought this was a property we wanted. >>>>>>> >>>>>>> Also, in the block_set_hostcache case we need to reopen without >>>>>>> switching to a new BlockDriverState instance. That means the reopen >>>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer. >>>>>>> We cannot create a new instance. >>>>>> >>>>>> Yes, but where do you even get the second BlockDriverState from? >>>>>> >>>>>> My prototype only returns an int, not a new BlockDriverState. Until >>>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and >>>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to >>>>>> the new ones. >>>>> >>>>> It seems I don't understand the API. I thought it was: >>>>> >>>>> do_block_set_hostcache() >>>>> { >>>>> bdrv_prepare_reopen(bs, &rs); >>>>> ...open new file and check everything is okay... >>>>> if (ret == 0) { >>>>> bdrv_reopen_commit(rs); >>>>> } else { >>>>> bdrv_reopen_abort(rs); >>>>> } >>>>> return ret; >>>>> } >>>>> >>>>> If the caller isn't opening the new file then what's the point of >>>>> giving the caller control over prepare, commit, and abort? >>>> >>>> After sending the last email I realized what I was missing: >>>> >>>> You need the prepare, commit, and abort API in order to handle >>>> multi-file block drivers like VMDK. >>> >>> Yes, this is whole point of separating commit out. Does the proposal >>> make sense to you now? >> >> It depends on the details. Adding more functions that every BlockDriver >> must implement is bad, so it's important that we only drop this >> functionality into raw-posix.c, vmdk.c, and block.c as appropriate. > > Yes, I agree. > >> I liked the idea of doing a generic FDStash type that the monitor and >> bdrv_reopen() can use. Blue's idea to hook at the qemu_open() level >> takes that further. > > Well, having something that works for the raw-posix, the monitor and > maybe some more things is nice. Having something that works for > raw-posix, Sheepdog and rbd is an absolute requirement and I can't see > how FDStash solves that.
For Sheepdog also network access functions would need to be hooked. RBD seems to use librados functions for low level I/O, so that needs some RBD specific wrappers. > Even raw-win32 doesn't have an int fd, but a > HANDLE hfile for its backend. Just replace "int fd" with "FDStash fd" everywhere? > So my main concern is that file descriptors are a concept not as generic > as it needs to be to suit all block drivers. > >> But if we can do prepare, commit, and abort in a relatively simple way >> then I'm for it. > > Great, then let's do that. > > Kevin > >