On Thu, Mar 8, 2012 at 2:27 AM, Dong Xu Wang <wdon...@linux.vnet.ibm.com> wrote: > On Wed, Mar 7, 2012 at 00:59, Stefan Hajnoczi <stefa...@gmail.com> wrote: >> On Thu, Mar 1, 2012 at 2:56 AM, Dong Xu Wang <wdon...@linux.vnet.ibm.com> >> wrote: >>> +static int add_cow_update_bitmap(BlockDriverState *bs, int64_t sector_num, >>> + int nb_sectors) >>> +{ >>> + BDRVAddCowState *s = bs->opaque; >>> + uint8_t *bitmap; >>> + >>> + int i, ret = 0; >>> + for (i = 0; i < nb_sectors; i++) { >>> + int ret = add_cow_cache_get(bs, s->bitmap_cache, >>> + (sector_num + i) / 8 & ~(ADD_COW_CLUSTER_SIZE - 1), (void >>> **)&bitmap); >>> + if (ret < 0) { >>> + abort(); >>> + } >>> + *(bitmap + ((sector_num + i) / 8 & (ADD_COW_CLUSTER_SIZE - 1))) |= >>> + (1 << ((sector_num + i) % 8)); >>> + add_cow_cache_entry_mark_dirty(s->bitmap_cache, bitmap); >>> + >>> + } >>> + ret = add_cow_cache_flush(bs, s->bitmap_cache); >> >> Data must be flushed to the cow file before writing metadata updates, >> otherwise a crash in between could result in zero sectors instead of >> backing file sectors. >> > Sorry, what does it mean? The correct steps shoud be: > 1. flush to the image file > 2. if 1 succeeds, flush to add-cow? > That means metadata updates should in image file flushing step, not > in add_cow_co_writev?
You only need to flush when there are metadata updates. So when the guest writes to a sector that is already allocated in the image file, no metadata updates are necessary and we don't need to flush. The guest is expect to issue a flush operation if it really wants to flush. However, there is a data integrity problem in the case where metadata updates *are* required. When we allocate new sectors in the image file we need to make sure that they have been populated and are safely on disk. Otherwise we risk data corruption if the system crashes. Here is the scenario: 1. Guest issues a write, this will require allocating new sectors in the image file. Therefore we'll need to update the bitmap in the .add-cow file. 2. Write guest data buffer to image file. 3. Write bitmap update to .add-cow file. If there is no flush between #2 and #3 then the write operations are unordered. #2 could appear before #3. Or #3 could appear before #2. I say "appear" because the write operation will complete in order, but we're not guaranteed to "see" the data on disk after crash due to volatile disk write caches or the host OS page cache (when -drive ...,cache=writeback). If we want to ensure that #2 happens before #3 then we need to flush before #3. This will force all layers of caching to put the data safely onto disk. The reason *why* this is necessary is because we must avoid putting #3 on disk before #2. If the system crashes and #2 never makes it to disk but #3 did, then we will have allocated sectors in the image file which are empty - they will be full of zeroes. That is incorrect because the guest either should see the data from the backing file if the write didn't complete during the crash, or it should see the data it was writing if the write did complete during the crash. >>> +static int bdrv_add_cow_truncate(BlockDriverState *bs, int64_t offset) >>> +{ >>> + int ret = 0; >>> + int64_t image_sectors = offset / BDRV_SECTOR_SIZE; >>> + BDRVAddCowState *s = bs->opaque; >>> + int64_t old_image_sector = s->image_hd->total_sectors; >>> + >>> + ret = bdrv_truncate(bs->file, sizeof(AddCowHeader) + ((image_sectors + >>> 7) >> 3)); >>> + if (ret < 0) { >>> + bdrv_truncate(s->image_hd, old_image_sector * BDRV_SECTOR_SIZE); >>> + return ret; >>> + } >>> + return ret; >>> +} >> >> I'm surprised that we truncate the image file...but only in the error case. >> >> Also, do we need to resize the cow cache? >> > cow cache should be resized, because the function will be called while > creating. > > Kevin gave comments in v2: > "bs->file only contains metadata, so why is this correct? Shoudln't you > update the header with the new size and truncate s->image_hd?" > > So I want to know should image_file have the same virtual size as the > backing_file? qcow2 and qed do not require backing file and image to have the same size. However, when you truncate it changes the size of the image file. I think it's good to have the flexibility of a smaller (or larger) backing file. Supporting this is useful because there are some cases, like provisioning new VMs where it can be advantageous to take a compact, small backing file and create a larger image from it which might append a data partition at the end of the disk. >> 2. Code that accesses the bitmap first divides by ADD_COW_CLUSTER_SIZE >> to operate on an entire cluster. Right now the lookup code seems to >> do it *during* the bitmap lookup instead of before. > Sorry, I do not understand this comment clearly. Could you explain more? If you implement clusters (for example, 64 KB size) then each bit in the bitmap covers 64 KB of data in the image file. This is why I expected the code to divide by cluster size before accessing the bitmap - because we discard the information about which exact sector we're accessing and only work at 64 KB granularity. Stefan