Am 25.10.2012 19:09, schrieb Jamie Lokier: > Kevin Wolf wrote: >> Am 24.10.2012 16:32, schrieb Jamie Lokier: >>> Kevin Wolf wrote: >>>> Am 24.10.2012 14:16, schrieb Nicholas Thomas: >>>>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: >>>>>> Since the I/O _order_ before, and sometimes after, flush, is important >>>>>> for data integrity, this needs to be maintained when I/Os are queued in >>>>>> the disconnected state -- including those which were inflight at the >>>>>> time disconnect was detected and then retried on reconnect. >>>>> >>>>> Hmm, discussing this on IRC I was told that it wasn't necessary to >>>>> preserve order - although I forget the fine detail. Depending on the >>>>> implementation of qemu's coroutine mutexes, operations may not actually >>>>> be performed in order right now - it's not too easy to work out what's >>>>> happening. >>>> >>>> It's possible to reorder, but it must be consistent with the order in >>>> which completion is signalled to the guest. The semantics of flush is >>>> that at the point that the flush completes, all writes to the disk that >>>> already have completed successfully are stable. It doesn't say anything >>>> about writes that are still in flight, they may or may not be flushed to >>>> disk. >>> >>> I admit I wasn't thinking clearly how much ordering NBD actually >>> guarantees (or if there's ordering the guest depends on implicitly >>> even if it's not guaranteed in specification), and how that is related >>> within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees >>> that the guest expects for various emulated devices and their settings. >>> >>> The ordering (if any) needed from the NBD driver (or any backend) is >>> going to depend on the assumptions baked into the interface between >>> QEMU device emulation <-> backend. >>> >>> E.g. if every device emulation waited for all outstanding writes to >>> complete before sending a flush, then it wouldn't matter how the >>> backend reordered its requests, even getting the completions out of >>> order. >>> >>> Is that relationship documented (and conformed to)? >> >> No, like so many other things in qemu it's not spelt out explicitly. >> However, as I understand it it's the same behaviour as real hardware >> has, so device emulation at least for the common devices doesn't have to >> implement anything special for it. If the hardware even supports >> parallel requests, otherwise it would automatically only have a single >> request in flight (like IDE). > > That's why I mention virtio/FUA/NCQ/TCQ/SCSI-ORDERED, which are quite > common. > > They are features of devices which support multiple parallel requests, > but with certain ordering constraints conveyed by or expected by the > guest, which has to be ensured when it's mapped onto a QEMU fully > asynchronous backend. > > That means they are features of the hardware which device emulations > _do_ have to implement. If they don't, the storage is unreliable on > things like host power removal and virtual power removal.
Yes, device emulations that need to maintain a given order must pay attention to wait for completion of the previous requests. > If the backends are allowed to explicitly have no coupling between > different request types (even flush/discard and write), and ordering > constraints are being enforced by the order in which device emulations > submit and wait, that's fine. > > I mention this, because POSIX aio_fsync() is _not_ fully decoupled > according to it's specification. > > So it might be that some device emulations are depending on the > semantics of aio_fsync() or the QEMU equivalent by now; and randomly > reordering in the NBD driver in unusual circumstances (or any other > backend), would break those semantics. qemu AIO has always had this semantics since bdrv_aio_flush() was introduced. It behaves the same way for image files. So I don't see any problem with NBD making use of the same. Kevin