> Based on my limited understanding, I think FVD shares a > lot in common with the COW format (block/cow.c). > > But I think most of the advantages you mention could be considered as > additions to either qcow2 or qed. At any rate, the right way to have > that discussion is in the form of patches on the ML.
FVD is much more advanced than block/cow.c. I would be happy to discuss possible leverage, but setting aside the details of QCOW2, QED, and FVD, let’s start with a discussion of what is needed for the next generation image format. First of all, of course, we need high performance. Through extensive benchmarking, I identified three major performance overheads in image formats. The numbers cited below are based on the PostMark benchmark. See the paper for more details, http://researcher.watson.ibm.com/researcher/files/us-ctang/FVD-cow.pdf . P1) Increased disk seek distance caused by a compact image’s distorted data layout. Specifically, the average disk seek distance in QCOW2 is 460% longer than that in a RAW image. P2) Overhead of storing an image on a host file system. Specifically, a RAW image stored on ext3 is 50-63% slower than a RAW image stored on a raw partition. P3) Overhead in reading or updating an image format’s on-disk metadata. Due to this overhead, QCOW2 causes 45% more total disk I/Os (including I/Os for accessing both data and metadata) than FVD does. For P1), I uses the term compact image instead of sparse image, because a RAW image stored as a sparse file in ext3 is a sparse image, but is not a compact image. A compact image stores data in such a way that the file size of the image file is smaller than the size of the virtual disk perceived by the VM. QCOW2 is a compact image. The disadvantage of a compact image is that the data layout perceived by the guest OS differs from the actual layout on the physical disk, which defeats many optimizations in guest file systems. Consider one concrete example. When the guest VM issues a disk I/O request to the hypervisor using a virtual block address (VBA), QEMU’s block device driver translates the VBA into an image block address (IBA), which specifies where the requested data are stored in the image file, i.e., IBA is an offset in the image file. When a guest OS creates or resizes a file system, it writes out the file system metadata, which are all grouped together and assigned consecutive image block addresses (IBAs) by QCOW2, despite the fact that the metadata’s virtual block addresses (VBAs) are deliberately scattered for better reliability and locality, e.g., co-locating inodes and file content blocks in block groups. As a result, it may cause a long disk seek distance between accessing a file’s metadata and accessing the file’s content blocks. For P2), using a host file system is inefficient, because 1) historically file systems are optimized for small files rather than large images, and 2) certain functions of a host file system are simply redundant with respect to the function of a compact image, e.g., performing storage allocation. Moreover, using a host file system not only adds overhead, but also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC, it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data integrity in the event of a host crash. See http://lwn.net/Articles/348739/ . For P3), it includes the overhead in reading on-disk metadata and the overhead in updating on-disk metadata. The former can be reduced by minimizing the size of metadata so that they can be easily cached in memory. Reducing the latter requires optimizations to avoid updating the on-disk metadata whenever possible, while not compromising data integrity in the event of a host crash. In addition to addressing the performance overheads caused by P1-P3, ideally the next-generation image format should meet the following functional requirements and perhaps beyond. R1) Support storage over-commit. R2) Support compact image, copy-on-write, copy-on-read, and adaptive prefetching. R3) Allow eliminating the host file system to achieve high performance. R4) Make all these features orthogonal, i.e., each feature can be enabled or disabled individually without affecting other features. The purpose is to support diverse use cases. For example, a copy-on-write image can use a RAW image like data layout to avoid the overhead associated with a compact image. Storage over-commit means that, e.g., a 100GB physical disk can be used to host 10 VMs, each with a 20GB virtual disk. This is possible because not every VM completely fills up its 20GB virtual disk. It is not mandatory to use compact image in order to support storage over-commit. For example, RAW images stored as sparse files on ext3 support storage over-commit. Copy-on-read and adaptive prefetching compliment copy-on-write in certain use cases, e.g., in a Cloud where the backing image is stored on network-attached storage (NAS) while the copy-on-write image is stored on direct-attached storage (DAS). When the VM reads a block from the backing image, a copy of the data is saved in the copy-on-write image for later reuse. Adaptive prefetching finds resource idle times to copy from NAS to DAS parts of the image that have not been accessed by the VM before. Prefetching should be conservative in that if the driver detects a contention on any resource (including DAS, NAS, or network), it pauses prefetching temporarily and resumes prefetching later when congestion disappears. Next, let me briefly describe how FVD is designed to address the performance issues P1-P3 and the functional requirements R1-R4. FVD has the following features. F1) Use a bitmap to implement copy-on-write. F2) Use a one-level lookup table to implement compact image. F3) Use a journal to commit changes to the bitmap and the lookup table. F4) Store a compact image on a logical volume to support storage over-commit, and to avoid the overhead and data integrity issues of a host file system. For F1), a bit in the bitmap tracks the state of a block. The bit is 0 if the block is in the base image, and the bit is 1 if the block is in the FVD image. The default size of a block is 64KB, as that in QCOW2. To represent the state of a 1TB base image, FVD only needs a 2MB bitmap, which can be easily cached in memory. This bitmap also implements copy-on-read and adaptive prefetching. For F2), one entry in the table maps the virtual disk address of a chunk to an offset in the FVD image where the chunk is stored. The default size of a chunk is 1MB, as that in VirtualBox VDI (VMware VMDK and Microsoft VHD use a chunk size of 2MB). For a 1TB virtual disk, the size of the lookup table is only 4MB. Because of the small size, there is no need to use a two-level lookup table as that in QCOW2. F1) and F2) are essential. They meet the requirement R4), i.e., the features of copy-on-write and compact image can be enabled individually. F1) and F2) are closest to the Microsoft Virtual Hard Disk (VHD) format, which also uses a bitmap and a one-level table. There are some key differences though. VHD partitions the bitmap and stores a fragment of the bitmap with every 2MB chunk. As a result, VHD does not meet the requirement R4, because it cannot have a copy-on-write image using a RAW image like data layout. Also because of that, a bit in VHD can only represents the state of a 512-byte sector (if a bit represents a 64KB block, the chunk size then has to be 2GB, which is way too large and makes storage over-commit ineffective). For a 1TB image, the size of the bitmap in VHD is 256MB, vs. 2MB in FVD, which makes caching more difficult. F3) uses a journal to commit metadata updates, which is not essential and there are alternative implementations. F3) however does provide benefits in addressing P3) (i.e., reducing metadata update overhead) and simplifying implementation. By default, the size of the journal is 16MB. When the bitmap and/or the lookup table are updated by a write, the changes are saved in the journal. When the journal is full, the entire bitmap and the entire lookup table are flushed to the disk, and the journal can recycled for reuse. Because the bitmap and the lookup table are small, the flush is quick. The journal provides several benefits. First, updating both the bitmap and the lookup table requires only a single write to journal. Second, K concurrent updates to any potions of the bitmap or the lookup table are converted to K sequential writes in the journal, and they can be merged into a single write by the host Linux kernel. Third, it increases concurrency by avoiding locking the bitmap or the lookup table. For example, updating one bit in the bitmap requires writing a 512-byte sector to the on-disk bitmap. This bitmap sector covers a total of 512*8*64K=256MB data. That is, any two writes that target that 256MB data and require updating the bitmap cannot be processed concurrently. The journal solves this problem and eliminates locking. For F4), it is actually quite straightforward to eliminate the host file system. The main thing that an image format needs from the host file system is to perform storage allocation. This function, however, is already performed by a compact image. Using a host file system simply ends up doing storage allocation twice, which requires updating on-disk metadata twice and introduces distorted data layout twice. Therefore, if we migrate the necessary function of a host file system into an image format, in other words, implementing a mini file system in an image format, then we can get rid of the host file system. This is exactly what FVD does, by slightly enhancing the compact image function that is already there. FVD can manage incrementally added storage space, like ZFS and unlike ext2/3/4. For example, when FVD manages a 100GB virtual disk, it initially gets 5GB storage space from the logical volume manager and uses it to host many 1MB chunks. When the first 5GB is used up, FVD gets another 5GB to host more 1MB chunks, and so forth. Unlike QCOW2 and more like a file system, FVD does not have to allocate a new chunk always right after where the previous chunk was allocated. Instead, it may spread out used chunks in the storage space in order to mimic a raw image like data layout. More details will be explained in follow up emails. The description above is long but is still a summary. Please refer to more detailed information on the web site, http://researcher.watson.ibm.com/researcher/view_project.php?id=1852 . Hopefully I have given a summary of the problems, the requirements, and the solutions in FVD, which can serve as the basis for a productive discussion. Regards, ChunQiang (CQ) Tang, Ph.D. Homepage: http://www.research.ibm.com/people/c/ctang