> In any case, the next step is to get down to specifics. Here is the > page with the current QCOW3 roadmap: > > http://wiki.qemu.org/Qcow3_Roadmap > > Please raise concrete requirements or features so they can be > discussed and captured.
Now it turns into a more productive discussion, but it seems to lose the big picture too quickly and has gone too narrowly into issues like the “dirty bit”. Let’s try to answer a bigger question: how to take a holistic approach to address all the factors that make a virtual disk slower than a physical disk? Even if issues like the “dirty bit” are addressed perfectly, they may still only be a small part of the total solution. The discussion of internal snapshot is at the end of this email. Compared with a physical disk, a virtual disk (even RAW) incurs some or all of the following overheads. Obviously, the way to achieve high performance is to eliminate or reduce these overheads. Overhead at the image level: I1: Data fragmentation caused by an image format. I2: Overhead in reading an image format’s metadata from disk. I3: Overhead in writing an image format’s metadata to disk. I4: Inefficiency and complexity in the block driver implementation, e.g., waiting synchronously for reading or writing metadata, submitting I/O requests sequentially when they should be done concurrently, performing a flush unnecessarily, etc. Overhead at the host file system level: H1: Data fragmentation caused by a host file system. H2: Overhead in reading a host file system’s metadata. H3: Overhead in writing a host file system’s metadata. Existing image formats by design do not address many of these issues, which is the reason why FVD was invented ( http://wiki.qemu.org/Features/FVD). Let’s look at these issues one by one. Regarding I1: Data fragmentation caused by an image format: This problem exists in most image formats, as they insist on doing storage allocation for the second time at the image level (including QCOW2, QED, VMDK, VDI, VHD, etc.), even if the host file system already does storage allocation. These image formats unnecessarily mix the function of storage allocation with the function of copy-on-write, i.e., they determine whether a cluster is dirty by checking whether it has storage space allocated at the image level. This is wrong. Storage allocation and tracking dirty clusters are two separate functions. Data fragmentation at the image level can be totally avoided by using a RAW image plus a bitmap header to indicate whether clusters are dirty due to copy-on-write. FVD can be configured to take this approach, although it can also be configured to do storage allocation. Doing storage allocation at the image level can be optional, but should never be mandatory. Regarding I2: Overhead in reading an image format’s metadata from disk: Obviously, the solution is to make the metadata small so that it can be cached entirely in memory. Is this aspect, QCOW1/QCOW2/QED and VMDK-workstation-version are wrong, and VirtualBox VDI, Microsoft VHD, and VMDK-esx-server-version are right. With QCOW1/QCOW2/QED, for a 1TB virtual disk, the metadata size is at least 128MB. By contrast, with VDI, for a 1TB virtual disk, the metadata size is only 4MB. The “wrong formats” all use a two-level lookup table to do storage allocation at a small granularity (e.g., 64KB), whereas the “right formats” all use a one-level lookup table to do storage allocation at a large granularity (1MB or 2MB). The one-level table is easier to implementation. Note that VMware VMDK started wrong in VMware’s workstation version, and then was corrected to be right in the ESX server version, which is a good move. As virtual disks grow bigger, it is likely that the storage allocation unit will be increased in the future, e.g., to 10MB or even larger. In existing image formats, one limitation of using a large storage allocation unit is that it forces copy-on-write being performed on a large cluster (e.g., 10MB in the future), which is sort of wrong. FVD gets the bests of both worlds. It uses a one-level table to perform storage allocation at a large granularity, but uses a bitmap to track copy-on-write at a smaller granularity. For a 1TB virtual disk, this approach needs only 6MB metadata, slightly larger than VDI’s 4MB. Regarding I3: Overhead in writing an image format’s metadata to disk: This is where the “dirty bit” discussion fits, but FVD goes way beyond that to reduce metadata updates. When an FVD image is fully optimized (e.g., the one-level lookup table is disabled and the base image is reduced to its minimum size), FVD has almost zero overhead in metadata update and the data layout is just like a RAW image. More specifically, metadata updates are skipped, delayed, batched, or merged as much as possible without compromising data integrity. First, even with cache=writethrough (i.e., O_DSYNC), all metadata updates are sequential writes to FVD’s journal, which can be merged into a single write by the host Linux kernel. Second, when cache!=writethrough, metadata updates are batched and sent to the journal either on a flush, or memory pressure, or periodically cleaned, just like page cache in kernel. Third, FVD’s table can be (preferably) disabled and hence it incurs no update overhead. Even if the table is enabled, FVD’s chunk is much larger than QCOW2/QED’s cluster, and hence needs less updates. Finally, although QCOW2/QED and FVD use the same block/cluster size, FVD can be optimized to eliminate most bitmap updates with several techniques: A) Use resize2fs to reduce the base image to its minimum size (which is what a Cloud can do) so that most writes occur at locations beyond the size of the base image, without the need to update the bitmap; B) ‘qemu-img create’ can find zero-filled sectors in a sparse base image and preset the corresponding bits of bitmap, which then requires no runtime update; and C) copy-on-read and prefetching do not update the bitmap and once prefetching finishes, there is completely no need for FVD to read or write the bitmap. Again, when an FVD image is fully optimized (e.g., the table is disabled and the base image is reduced to its minimum size), FVD has almost zero overhead in metadata update and the data layout is just like a RAW image. Regarding I4: Inefficiency in block driver, e.g., synchronous metadata read/write: Today, FVD is the only fully asynchronous, nonblocking COW driver implemented for QEMU, and has the best performance. This is partially due to its simple design. The one-level table is easy to implement than a two-level table. The journal avoids sophisticated locking that would otherwise be required for performing metadata updates. FVD parallelizes I/Os to the maximum degree possible. For example, if processing a VM-generated read request needs to read data from the base image as well as several non-continuous chunks in the FVD image, FVD issues all I/O requests in parallel rather than sequentially. Regarding H1&H2&H3: host file system caused fragmentation and metadata read/write: FVD can be optionally configured to get rid of the host file system and store an image on a logical volume directly. This seems straightforward but a naïve solution like that currently in QCOW2 would not be able to achieve storage thin provisioning (i.e., storage over-commit), as the initial logical volume size need be allocated to the full size of the image. FVD supports thin provisioning on a logical volume, by starting with a small one and growing it automatically when needed. It is quite easy for FVD to track the size of used space, without the need to update a size field in the image header on every storage allocation (which is a problem in VDI). There are multiple efficient solutions possible in FVD. One solution is to piggyback the size field as part of the journal entry that records a new storage allocation. Alternatively, even doing an ‘fsck’ like scan on FVD’s one-level lookup table to figure out the used space is trivial. Because the table is only 4MB for a 1TB virtual disk and it is contiguous in the image, a scan takes only about 20 milliseconds: 15 milliseconds to load 4MB from disk and less than 5 milliseconds to scan 4MB in memory. This is more efficient than a dirty bit in QCOW2 or QED. In summary, it seems that people’s imagination for QCOW3 is unfortunately limited by the overwhelming experience from QCOW2, without even looking at what VirtualBox VDI, VMware VMDK, and Microsoft VHD have done, not to mention going beyond all those to ascend to the next level. Regardless of its name, I hope QCOW3 will take the right actions to fix wrong things in QCOW2, including: A1: abandon a two-level table and adopt a one-level table, as that in VDI, VMDK, and VHD, for simplicity and much smaller metadata size. A2: introduce a bitmap to allow copy-on-write without doing storage allocation, which 1) avoids image-level fragmentation, 2) eliminates metadata update overhead for storage allocation, 3) allows copy-on-write being performed on a smaller storage unit (64KB) while still having very small metadata size. A3: introduce a journal to batch and merge metadata updates and to reduce fsck recovery time after a host crash. This is exactly the process how I arrived at the design of FVD. It is not by chance, but instead by taking a holistic approach to analyze problems in a virtual disk. I think the status of “QCOW3” today is comparable to FVD’s status 10 months ago when the design started to emerge, but FVD’s implementation today is very mature. It is the only asynchronous, nonblocking COW driver implemented for QEMU with undoubtedly the best performance, both by design and by implementation. Now let’s talk about features. It seems that there is great interest in QCOW2’ internal snapshot feature. If we really want to do that, the right solution is to follow VMDK’s approach of storing each snapshot as a separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather than using the reference count table. VMDK’s approach can be easily implemented for any COW format, or even as a function of the generic block layer, without complicating any COW format or hurting its performance. I know the snapshots are not really “internal” as stored in a single file but instead more like external snapshots, but users don’t care about that so long as they support the same use cases. Probably many people who use VMware don't even know that the snapshots are stored as separate files. Do they care? Regards, ChunQiang (CQ) Tang Homepage: http://www.research.ibm.com/people/c/ctang