Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

Anthony Liguori Tue, 07 Sep 2010 09:12:41 -0700

On 09/07/2010 09:51 AM, Avi Kivity wrote:

     /* if (features & QED_F_BACKING_FILE) */
     uint32_t backing_file_offset; /* in bytes from start of header */
     uint32_t backing_file_size;   /* in bytes */

It's really the filename size, not the file size. Also, make a notethat it is not zero terminated.


     /* if (compat_features & QED_CF_BACKING_FORMAT) */
     uint32_t backing_fmt_offset;  /* in bytes from start of header */
     uint32_t backing_fmt_size;    /* in bytes */


Why not make it mandatory?


You mean, why not make it:

/* if (features & QED_F_BACKING_FILE) */

As opposed to an independent compat feature. Mandatory features meanthat you cannot read an image format if you don't understand thefeature. In the context of backing_format, it means you have to haveall of the possible values fully defined.

IOW, what are valid values for backing_fmt? "raw" and "qed" are obviousbut what does it mean from a formal specification perspective to have"vmdk"? Is that VMDK v3 or v4, what if there's a v5?

If we make backing_fmt a suggestion, it gives us flexibility to leavethis poorly defined whereas implementation can fall back to probing ifthere's any doubt.

For the spec, I'd like to define "raw" and "qed". I'd like to modifythe qemu implementation to refuse to load an image as raw unlessbacking_fmt is raw but otherwise just probing.

For image creation, if an explicit backing format isn't specified by theuser, I'd like to insert backing_fmt=raw for probed raw images andotherwise, not specify a backing_fmt.


Regards,

Anthony Liguori

 }
Need a checksum for the header.
==Extent table==

 #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))

 Table {
     uint64_t offsets[TABLE_NOFFSETS];
 }
It's fashionable to put checksums here.
Do we want a real extent-based format like modern filesystems? Soafter defragmentation a full image has O(1) metadata?
The extent tables are organized as follows:

                    +----------+
                    | L1 table |
                    +----------+
               ,------'  |  '------.
          +----------+   |    +----------+
          | L2 table |  ...   | L2 table |
          +----------+        +----------+
      ,------'  |  '------.
 +----------+   |    +----------+
 |   Data   |  ...   |   Data   |
 +----------+        +----------+
The table_size field allows tables to be multiples of the clustersize. For example, cluster_size=64 KB and table_size=4 results in256 KB tables.
=Operations=

==Read==
# If L2 table is not present in L1, read from backing image.
# If data cluster is not present in L2, read from backing image.
# Otherwise read data from cluster.
If not in backing image, provide zeros
==Write==
# If L2 table is not present in L1, allocate new cluster and L2.Perform L2 and L1 link after writing data.# If data cluster is not present in L2, allocate new cluster.Perform L1 link after writing data.
# Otherwise overwrite data cluster.
Detail copy-on-write from backing image.
On a partial write without a backing file, do we recommendzero-filling the cluster (to avoid intra-cluster fragmentation)?
The L2 link '''should''' be made after the data is in place onstorage. However, when no ordering is enforced the worst casescenario is an L2 link to an unwritten cluster.
Or it may cause corruption if the physical file size is not committed,and L2 now points at a free cluster.
The L1 link '''must''' be made after the L2 cluster is in place onstorage. If the order is reversed then the L1 table may point to abogus L2 table. (Is this a problem since clusters are allocated atthe end of the file?)
==Grow==
# If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW.The L1 table is not big enough.
With a variable-height tree, we allocate a new root, link its firstentry to the old root, and write the new header with updated root andheight.
# Write new image_size header field.

=Data integrity=
==Write==
Writes that complete before a flush must be stable when the flushcompletes.
If storage is interrupted (e.g. power outage) then writes in progressmay be lost, stable, or partially completed. The storage must not beotherwise corrupted or inaccessible after it is restarted.
We can remove this requirement by copying-on-write any metadata write,and keeping two copies of the header (with version numbers andchecksums). Enterprise storage will not corrupt on writes, butcommodity storage may.

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

Reply via email to