Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

Chunqiang Tang Fri, 14 Jan 2011 12:56:36 -0800

> Based on my limited understanding, I think FVD shares a 
> lot in common with the COW format (block/cow.c).
> 
> But I think most of the advantages you mention could be considered as 
> additions to either qcow2 or qed.  At any rate, the right way to have 
> that discussion is in the form of patches on the ML.


FVD is much more advanced than block/cow.c. I would be happy to discuss 
possible leverage, but setting aside the details of QCOW2, QED, and FVD, 
let’s start with a discussion of what is needed for the next generation 
image format. 

First of all, of course, we need high performance. Through extensive 
benchmarking, I identified three major performance overheads in image 
formats. The numbers cited below are based on the PostMark benchmark. See 
the paper for more details,  
http://researcher.watson.ibm.com/researcher/files/us-ctang/FVD-cow.pdf .

P1) Increased disk seek distance caused by a compact image’s distorted 
data layout. Specifically, the average disk seek distance in QCOW2 is 460% 
longer than that in a RAW image.

P2) Overhead of storing an image on a host file system. Specifically, a 
RAW image stored on ext3 is 50-63% slower than a RAW image stored on a raw 
partition.

P3) Overhead in reading or updating an image format’s on-disk metadata. 
Due to this overhead, QCOW2 causes 45% more total disk I/Os (including 
I/Os for accessing both data and metadata) than FVD does.

For P1), I uses the term compact image instead of sparse image, because a 
RAW image stored as a sparse file in ext3 is a sparse image, but is not a 
compact image. A compact image stores data in such a way that the file 
size of the image file is smaller than the size of the virtual disk 
perceived by the VM. QCOW2 is a compact image. The disadvantage of a 
compact image is that the data layout perceived by the guest OS differs 
from the actual layout on the physical disk, which defeats many 
optimizations in guest file systems. Consider one concrete example. When 
the guest VM issues a disk I/O request to the hypervisor using a virtual 
block address (VBA), QEMU’s block device driver translates the VBA into an 
image block address (IBA), which specifies where the requested data are 
stored in the image file, i.e., IBA is an offset in the image file. When a 
guest OS creates or resizes a file system, it writes out the file system 
metadata, which are all grouped together and assigned consecutive image 
block addresses (IBAs) by QCOW2, despite the fact that the metadata’s 
virtual block addresses (VBAs) are deliberately scattered for better 
reliability and locality, e.g., co-locating inodes and file content blocks 
in block groups. As a result, it may cause a long disk seek distance 
between accessing a file’s metadata and accessing the file’s content 
blocks. 

For P2), using a host file system is inefficient, because 1) historically 
file systems are optimized for small files rather than large images, and 
2) certain functions of a host file system are simply redundant with 
respect to the function of a compact image, e.g., performing storage 
allocation. Moreover, using a host file system not only adds overhead, but 
also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC, 
it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data 
integrity in the event of a host crash. See 
http://lwn.net/Articles/348739/ . 

For P3), it includes the overhead in reading on-disk metadata and the 
overhead in updating on-disk metadata. The former can be reduced by 
minimizing the size of metadata so that they can be easily cached in 
memory. Reducing the latter requires optimizations to avoid updating the 
on-disk metadata whenever possible, while not compromising data integrity 
in the event of a host crash. 

In addition to addressing the performance overheads caused by P1-P3, 
ideally the next-generation image format should meet the following 
functional requirements and perhaps beyond.

R1) Support storage over-commit.
R2) Support compact image, copy-on-write, copy-on-read, and adaptive 
prefetching.
R3) Allow eliminating the host file system to achieve high performance.
R4) Make all these features orthogonal, i.e., each feature can be enabled 
or disabled individually without affecting other features. The purpose is 
to support diverse use cases. For example, a copy-on-write image can use a 
RAW image like data layout to avoid the overhead associated with a compact 
image. 

Storage over-commit means that, e.g., a 100GB physical disk can be used to 
host 10 VMs, each with a 20GB virtual disk. This is possible because not 
every VM completely fills up its 20GB virtual disk. It is not mandatory to 
use compact image in order to support storage over-commit. For example, 
RAW images stored as sparse files on ext3 support storage over-commit. 
Copy-on-read and adaptive prefetching compliment copy-on-write in certain 
use cases, e.g., in a Cloud where the backing image is stored on 
network-attached storage (NAS) while the copy-on-write image is stored on 
direct-attached storage (DAS). When the VM reads a block from the backing 
image, a copy of the data is saved in the copy-on-write image for later 
reuse. Adaptive prefetching finds resource idle times to copy from NAS to 
DAS parts of the image that have not been accessed by the VM before. 
Prefetching should be conservative in that if the driver detects a 
contention on any resource (including DAS, NAS, or network), it pauses 
prefetching temporarily and resumes prefetching later when congestion 
disappears.

Next, let me briefly describe how FVD is designed to address the 
performance issues P1-P3 and the functional requirements R1-R4. FVD has 
the following features.

F1) Use a bitmap to implement copy-on-write.
F2) Use a one-level lookup table to implement compact image.
F3) Use a journal to commit changes to the bitmap and the lookup table.
F4) Store a compact image on a logical volume to support storage 
over-commit, and to avoid the overhead and data integrity issues of a host 
file system.

For F1), a bit in the bitmap tracks the state of a block. The bit is 0 if 
the block is in the base image, and the bit is 1 if the block is in the 
FVD image. The default size of a block is 64KB, as that in QCOW2. To 
represent the state of a 1TB base image, FVD only needs a 2MB bitmap, 
which can be easily cached in memory. This bitmap also implements 
copy-on-read and adaptive prefetching.

For F2), one entry in the table maps the virtual disk address of a chunk 
to an offset in the FVD image where the chunk is stored. The default size 
of a chunk is 1MB, as that in VirtualBox VDI (VMware VMDK and Microsoft 
VHD use a chunk size of 2MB). For a 1TB virtual disk, the size of the 
lookup table is only 4MB. Because of the small size, there is no need to 
use a two-level lookup table as that in QCOW2.

F1) and F2) are essential. They meet the requirement R4), i.e., the 
features of copy-on-write and compact image can be enabled individually. 
F1) and F2) are closest to the Microsoft Virtual Hard Disk (VHD) format, 
which also uses a bitmap and a one-level table. There are some key 
differences though. VHD partitions the bitmap and stores a fragment of the 
bitmap with every 2MB chunk. As a result, VHD does not meet the 
requirement R4, because it cannot have a copy-on-write image using a RAW 
image like data layout. Also because of that, a bit in VHD can only 
represents the state of a 512-byte sector (if a bit represents a 64KB 
block, the chunk size then has to be 2GB, which is way too large and makes 
storage over-commit ineffective). For a 1TB image, the size of the bitmap 
in VHD is 256MB, vs. 2MB in FVD, which makes caching more difficult. 

F3) uses a journal to commit metadata updates, which is not essential and 
there are alternative implementations. F3) however does provide benefits 
in addressing P3) (i.e., reducing metadata update overhead) and 
simplifying implementation. By default, the size of the journal is 16MB. 
When the bitmap and/or the lookup table are updated by a write, the 
changes are saved in the journal. When the journal is full, the entire 
bitmap and the entire lookup table are flushed to the disk, and the 
journal can recycled for reuse. Because the bitmap and the lookup table 
are small, the flush is quick. The journal provides several benefits. 
First, updating both the bitmap and the lookup table requires only a 
single write to journal. Second, K concurrent updates to any potions of 
the bitmap or the lookup table are converted to K sequential writes in the 
journal, and they can be merged into a single write by the host Linux 
kernel. Third, it increases concurrency by avoiding locking the bitmap or 
the lookup table. For example, updating one bit in the bitmap requires 
writing a 512-byte sector to the on-disk bitmap. This bitmap sector covers 
a total of 512*8*64K=256MB data. That is, any two writes that target that 
256MB data and require updating the bitmap cannot be processed 
concurrently. The journal solves this problem and eliminates locking.

For F4), it is actually quite straightforward to eliminate the host file 
system. The main thing that an image format needs from the host file 
system is to perform storage allocation. This function, however, is 
already performed by a compact image. Using a host file system simply ends 
up doing storage allocation twice, which requires updating on-disk 
metadata twice and introduces distorted data layout twice. Therefore, if 
we migrate the necessary function of a host file system into an image 
format, in other words, implementing a mini file system in an image 
format, then we can get rid of the host file system. This is exactly what 
FVD does, by slightly enhancing the compact image function that is already 
there. FVD can manage incrementally added storage space, like ZFS and 
unlike ext2/3/4. For example, when FVD manages a 100GB virtual disk, it 
initially gets 5GB storage space from the logical volume manager and uses 
it to host many 1MB chunks. When the first 5GB is used up, FVD gets 
another 5GB to host more 1MB chunks, and so forth. Unlike QCOW2 and more 
like a file system, FVD does not have to allocate a new chunk always right 
after where the previous chunk was allocated. Instead, it may spread out 
used chunks in the storage space in order to mimic a raw image like data 
layout. More details will be explained in follow up emails. 

The description above is long but is still a summary. Please refer to more 
detailed information on the web site, 
http://researcher.watson.ibm.com/researcher/view_project.php?id=1852 . 
Hopefully I have given a summary of the problems, the requirements, and 
the solutions in FVD, which can serve as the basis for a productive 
discussion.


Regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

Reply via email to