Re: [Qemu-devel] Re: Strategic decision: COW format

Chunqiang Tang Tue, 22 Feb 2011 19:38:24 -0800

> In any case, the next step is to get down to specifics.  Here is the
> page with the current QCOW3 roadmap:
> 
> http://wiki.qemu.org/Qcow3_Roadmap
>
> Please raise concrete requirements or features so they can be
> discussed and captured.


Now it turns into a more productive discussion, but it seems to lose the 
big picture too quickly and has gone too narrowly into issues like the 
“dirty bit”. Let’s try to answer a bigger question: how to take a holistic 
approach to address all the factors that make a virtual disk slower than a 
physical disk? Even if issues like the “dirty bit” are addressed 
perfectly, they may still only be a small part of the total solution. The 
discussion of internal snapshot is at the end of this email.

Compared with a physical disk, a virtual disk (even RAW) incurs some or 
all of the following overheads. Obviously, the way to achieve high 
performance is to eliminate or reduce these overheads.

Overhead at the image level:
I1: Data fragmentation caused by an image format.
I2: Overhead in reading an image format’s metadata from disk.
I3: Overhead in writing an image format’s metadata to disk.
I4: Inefficiency and complexity in the block driver implementation, e.g., 
waiting synchronously for reading or writing metadata, submitting I/O 
requests sequentially when they should be done concurrently, performing a 
flush unnecessarily, etc.

Overhead at the host file system level:
H1: Data fragmentation caused by a host file system.
H2: Overhead in reading a host file system’s metadata.
H3: Overhead in writing a host file system’s metadata.

Existing image formats by design do not address many of these issues, 
which is the reason why FVD was invented (
http://wiki.qemu.org/Features/FVD).  Let’s look at these issues one by 
one.

Regarding I1: Data fragmentation caused by an image format:
This problem exists in most image formats, as they insist on doing storage 
allocation for the second time at the image level (including QCOW2, QED, 
VMDK, VDI, VHD, etc.), even if the host file system already does storage 
allocation. These image formats unnecessarily mix the function of storage 
allocation with the function of copy-on-write, i.e., they determine 
whether a cluster is dirty by checking whether it has storage space 
allocated at the image level. This is wrong. Storage allocation and 
tracking dirty clusters are two separate functions. Data fragmentation at 
the image level can be totally avoided by using a RAW image plus a bitmap 
header to indicate whether clusters are dirty due to copy-on-write. FVD 
can be configured to take this approach, although it can also be 
configured to do storage allocation.  Doing storage allocation at the 
image level can be optional, but should never be mandatory.

Regarding I2: Overhead in reading an image format’s metadata from disk:
Obviously, the solution is to make the metadata small so that it can be 
cached entirely in memory. Is this aspect, QCOW1/QCOW2/QED and 
VMDK-workstation-version are wrong, and VirtualBox VDI, Microsoft VHD, and 
VMDK-esx-server-version are right. With QCOW1/QCOW2/QED, for a 1TB virtual 
disk, the metadata size is at least 128MB. By contrast, with VDI, for a 
1TB virtual disk, the metadata size is only 4MB. The “wrong formats” all 
use a two-level lookup table to do storage allocation at a small 
granularity (e.g., 64KB), whereas the “right formats” all use a one-level 
lookup table to do storage allocation at a large granularity (1MB or 2MB). 
The one-level table is easier to implementation. Note that VMware VMDK 
started wrong in VMware’s workstation version, and then was corrected to 
be right in the ESX server version, which is a good move. As virtual disks 
grow bigger, it is likely that the storage allocation unit will be 
increased in the future, e.g., to 10MB or even larger. In existing image 
formats, one limitation of using a large storage allocation unit is that 
it forces copy-on-write being performed on a large cluster (e.g., 10MB in 
the future), which is sort of wrong. FVD gets the bests of both worlds. It 
uses a one-level table to perform storage allocation at a large 
granularity, but uses a bitmap to track copy-on-write at a smaller 
granularity. For a 1TB virtual disk, this approach needs only 6MB 
metadata, slightly larger than VDI’s 4MB.

Regarding I3: Overhead in writing an image format’s metadata to disk:
This is where the “dirty bit” discussion fits, but FVD goes way beyond 
that to reduce metadata updates.  When an FVD image is fully optimized 
(e.g., the one-level lookup table is disabled and the base image is 
reduced to its minimum size), FVD has almost zero overhead in metadata 
update and the data layout is just like a RAW image. More specifically, 
metadata updates are skipped, delayed, batched, or merged as much as 
possible without compromising data integrity. First, even with 
cache=writethrough (i.e., O_DSYNC), all metadata updates are sequential 
writes to FVD’s journal, which can be merged into a single write by the 
host Linux kernel. Second, when cache!=writethrough, metadata updates are 
batched and sent to the journal either on a flush, or memory pressure, or 
periodically cleaned, just like page cache in kernel. Third, FVD’s table 
can be (preferably) disabled and hence it incurs no update overhead. Even 
if the table is enabled, FVD’s chunk is much larger than QCOW2/QED’s 
cluster, and hence needs less updates. Finally, although QCOW2/QED and FVD 
use the same block/cluster size, FVD can be optimized to eliminate most 
bitmap updates with several techniques: A) Use resize2fs to reduce the 
base image to its minimum size (which is what a Cloud can do) so that most 
writes occur at locations beyond the size of the base image, without the 
need to update the bitmap; B) ‘qemu-img create’ can find zero-filled 
sectors in a sparse base image and preset the corresponding bits of 
bitmap, which then requires no runtime update; and C) copy-on-read and 
prefetching do not update the bitmap and once prefetching finishes, there 
is completely no need for FVD to read or write the bitmap. Again, when an 
FVD image is fully optimized (e.g., the table is disabled and the base 
image is reduced to its minimum size), FVD has almost zero overhead in 
metadata update and the data layout is just like a RAW image.

Regarding I4: Inefficiency in block driver, e.g., synchronous metadata 
read/write:
Today, FVD is the only fully asynchronous, nonblocking COW driver 
implemented for QEMU, and has the best performance. This is partially due 
to its simple design. The one-level table is easy to implement than a 
two-level table. The journal avoids sophisticated locking that would 
otherwise be required for performing metadata updates. FVD parallelizes 
I/Os to the maximum degree possible. For example, if processing a 
VM-generated read request needs to read data from the base image as well 
as several non-continuous chunks in the FVD image, FVD issues all I/O 
requests in parallel rather than sequentially.

Regarding H1&H2&H3: host file system caused fragmentation and metadata 
read/write:
FVD can be optionally configured to get rid of the host file system and 
store an image on a logical volume directly. This seems straightforward 
but a naïve solution like that currently in QCOW2 would not be able to 
achieve storage thin provisioning (i.e., storage over-commit), as the 
initial logical volume size need be allocated to the full size of the 
image. FVD supports thin provisioning on a logical volume, by starting 
with a small one and growing it automatically when needed. It is quite 
easy for FVD to track the size of used space, without the need to update a 
size field in the image header on every storage allocation (which is a 
problem in VDI). There are multiple efficient solutions possible in FVD. 
One solution is to piggyback the size field as part of the journal entry 
that records a new storage allocation. Alternatively, even doing an ‘fsck’ 
like scan on FVD’s one-level lookup table to figure out the used space is 
trivial. Because the table is only 4MB for a 1TB virtual disk and it is 
contiguous in the image, a scan takes only about 20 milliseconds: 15 
milliseconds to load 4MB from disk and less than 5 milliseconds to scan 
4MB in memory. This is more efficient than a dirty bit in QCOW2 or QED.

In summary, it seems that people’s imagination for QCOW3 is unfortunately 
limited by the overwhelming experience from QCOW2, without even looking at 
what VirtualBox VDI, VMware VMDK, and Microsoft VHD have done, not to 
mention going beyond all those to ascend to the next level. Regardless of 
its name, I hope QCOW3 will take the right actions to fix wrong things in 
QCOW2, including:

A1: abandon a two-level table and adopt a one-level table, as that in VDI, 
VMDK, and VHD, for simplicity and much smaller metadata size.

A2: introduce a bitmap to allow copy-on-write without doing storage 
allocation, which 1) avoids image-level fragmentation, 2) eliminates 
metadata update overhead for storage allocation, 3) allows copy-on-write 
being performed on a smaller storage unit (64KB) while still having very 
small metadata size.

A3: introduce a journal to batch and merge metadata updates and to reduce 
fsck recovery time after a host crash.

This is exactly the process how I arrived at the design of FVD. It is not 
by chance, but instead by taking a holistic approach to analyze problems 
in a virtual disk. I think the status of “QCOW3” today is comparable to 
FVD’s status 10 months ago when the design started to emerge, but FVD’s 
implementation today is very mature. It is the only asynchronous, 
nonblocking COW driver implemented for QEMU with undoubtedly the best 
performance, both by design and by implementation. 

Now let’s talk about features. It seems that there is great interest in 
QCOW2’ internal snapshot feature. If we really want to do that, the right 
solution is to follow VMDK’s approach of storing each snapshot as a 
separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather 
than using the reference count table. VMDK’s approach can be easily 
implemented for any COW format, or even as a function of the generic block 
layer, without complicating any COW format or hurting its performance. I 
know the snapshots are not really “internal” as stored in a single file 
but instead more like external snapshots, but users don’t care about that 
so long as they support the same use cases. Probably many people who use 
VMware don't even know that the snapshots are stored as separate files. Do 
they care?

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] Re: Strategic decision: COW format

Reply via email to