Am 13.10.2010 14:13, schrieb Stefan Hajnoczi: > On Tue, Oct 12, 2010 at 11:16:17AM -0500, Anthony Liguori wrote: >> On 10/12/2010 10:59 AM, Stefan Hajnoczi wrote: >>> On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote: >>>> Am 12.10.2010 17:22, schrieb Anthony Liguori: >>>>> On 10/12/2010 10:08 AM, Kevin Wolf wrote: >>>>>> Otherwise we might destroy data that isn't >>>>>> even touched by the guest request in case of a crash. >>>>>> >>>>> The failure scenarios are either that the cluster is leaked in which >>>>> case, the old version of the data is still present or the cluster is >>>>> orphaned because the L2 entry is written, in which case the old version >>>>> of the data is present. >>>> Hm, how does the latter case work? Or rather, what do mean by "orphaned"? >>>> >>>>> Are you referring to a scenario where the cluster is partially written >>>>> because the data is present in the write cache and the write cache isn't >>>>> flushed on power failure? >>>> The case I'm referring to is a COW. So let's assume a partial write to >>>> an unallocated cluster, we then need to do a COW in pre/postfill. Then >>>> we do a normal write and link the new cluster in the L2 table. >>>> >>>> Assume that the write to the L2 table is already on the disk, but the >>>> pre/postfill data isn't yet. At this point we have a bad state because >>>> if we crash now we have lost the data that should have been copied from >>>> the backing file. >>> In this case QED_F_NEED_CHECK is set and the invalid cluster offset >>> should be reset to zero on open. >>> >>> However, I think we can get into a state where the pre/postfill data >>> isn't on the disk yet but another allocation has increased the file >>> size, making the unwritten cluster "valid". This fools consistency >>> check into thinking the data cluster (which was never written to on >>> disk) is valid. >>> >>> Will think about this more tonight. >> >> It's fairly simple to add a sync to this path. It's probably worth >> checking the prefill/postfill for zeros and avoiding the write/sync >> if that's the case. That should optimize the common cases of >> allocating new space within a file. >> >> My intuition is that we can avoid the sync entirely but we'll need >> to think about it further. > > We can avoid it when a backing image is not used. Your idea to check > for zeroes in the backing image is neat too, it may well reduce the > common case even for backing images.
The additional requirement is that we're extending the file and not reusing an old cluster. (And bdrv_has_zero_init() == true, but QED doesn't work on host_devices anyway) Kevin