On 09/07/2010 09:51 AM, Avi Kivity wrote:
/* if (features & QED_F_BACKING_FILE) */
uint32_t backing_file_offset; /* in bytes from start of header */
uint32_t backing_file_size; /* in bytes */
It's really the filename size, not the file size. Also, make a note
that it is not zero terminated.
/* if (compat_features & QED_CF_BACKING_FORMAT) */
uint32_t backing_fmt_offset; /* in bytes from start of header */
uint32_t backing_fmt_size; /* in bytes */
Why not make it mandatory?
You mean, why not make it:
/* if (features & QED_F_BACKING_FILE) */
As opposed to an independent compat feature. Mandatory features mean
that you cannot read an image format if you don't understand the
feature. In the context of backing_format, it means you have to have
all of the possible values fully defined.
IOW, what are valid values for backing_fmt? "raw" and "qed" are obvious
but what does it mean from a formal specification perspective to have
"vmdk"? Is that VMDK v3 or v4, what if there's a v5?
If we make backing_fmt a suggestion, it gives us flexibility to leave
this poorly defined whereas implementation can fall back to probing if
there's any doubt.
For the spec, I'd like to define "raw" and "qed". I'd like to modify
the qemu implementation to refuse to load an image as raw unless
backing_fmt is raw but otherwise just probing.
For image creation, if an explicit backing format isn't specified by the
user, I'd like to insert backing_fmt=raw for probed raw images and
otherwise, not specify a backing_fmt.
Regards,
Anthony Liguori
}
Need a checksum for the header.
==Extent table==
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
Table {
uint64_t offsets[TABLE_NOFFSETS];
}
It's fashionable to put checksums here.
Do we want a real extent-based format like modern filesystems? So
after defragmentation a full image has O(1) metadata?
The extent tables are organized as follows:
+----------+
| L1 table |
+----------+
,------' | '------.
+----------+ | +----------+
| L2 table | ... | L2 table |
+----------+ +----------+
,------' | '------.
+----------+ | +----------+
| Data | ... | Data |
+----------+ +----------+
The table_size field allows tables to be multiples of the cluster
size. For example, cluster_size=64 KB and table_size=4 results in
256 KB tables.
=Operations=
==Read==
# If L2 table is not present in L1, read from backing image.
# If data cluster is not present in L2, read from backing image.
# Otherwise read data from cluster.
If not in backing image, provide zeros
==Write==
# If L2 table is not present in L1, allocate new cluster and L2.
Perform L2 and L1 link after writing data.
# If data cluster is not present in L2, allocate new cluster.
Perform L1 link after writing data.
# Otherwise overwrite data cluster.
Detail copy-on-write from backing image.
On a partial write without a backing file, do we recommend
zero-filling the cluster (to avoid intra-cluster fragmentation)?
The L2 link '''should''' be made after the data is in place on
storage. However, when no ordering is enforced the worst case
scenario is an L2 link to an unwritten cluster.
Or it may cause corruption if the physical file size is not committed,
and L2 now points at a free cluster.
The L1 link '''must''' be made after the L2 cluster is in place on
storage. If the order is reversed then the L1 table may point to a
bogus L2 table. (Is this a problem since clusters are allocated at
the end of the file?)
==Grow==
# If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW.
The L1 table is not big enough.
With a variable-height tree, we allocate a new root, link its first
entry to the old root, and write the new header with updated root and
height.
# Write new image_size header field.
=Data integrity=
==Write==
Writes that complete before a flush must be stable when the flush
completes.
If storage is interrupted (e.g. power outage) then writes in progress
may be lost, stable, or partially completed. The storage must not be
otherwise corrupted or inaccessible after it is restarted.
We can remove this requirement by copying-on-write any metadata write,
and keeping two copies of the header (with version numbers and
checksums). Enterprise storage will not corrupt on writes, but
commodity storage may.