Am 28.10.2010 13:01, schrieb Stefan Hajnoczi: > Signed-off-by: Stefan Hajnoczi <stefa...@linux.vnet.ibm.com> > --- > docs/specs/qed_spec.txt | 128 > +++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 128 insertions(+), 0 deletions(-) > create mode 100644 docs/specs/qed_spec.txt > > diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt > new file mode 100644 > index 0000000..e4425c8 > --- /dev/null > +++ b/docs/specs/qed_spec.txt > @@ -0,0 +1,128 @@ > +=Specification= > + > +The file format looks like this: > + > + +----------+----------+----------+-----+ > + | cluster0 | cluster1 | cluster2 | ... | > + +----------+----------+----------+-----+ > + > +The first cluster begins with the '''header'''. The header contains > information about where regular clusters start; this allows the header to be > extensible and store extra information about the image file. A regular > cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 > and L2 tables are composed of one or more contiguous clusters. > + > +Normally the file size will be a multiple of the cluster size. If the file > size is not a multiple, extra information after the last cluster may not be > preserved if data is written. Legitimate extra information should use space > between the header and the first regular cluster. > + > +All fields are little-endian. > + > +==Header== > + Header { > + uint32_t magic; /* QED\0 */ > + > + uint32_t cluster_size; /* in bytes */ > + uint32_t table_size; /* for L1 and L2 tables, in clusters */ > + uint32_t header_size; /* in clusters */ > + > + uint64_t features; /* format feature bits */ > + uint64_t compat_features; /* compat feature bits */ > + uint64_t l1_table_offset; /* in bytes */ > + uint64_t image_size; /* total logical image size, in bytes */ > + > + /* if (features & QED_F_BACKING_FILE) */ > + uint32_t backing_filename_offset; /* in bytes from start of header */ > + uint32_t backing_filename_size; /* in bytes */ > + } > + > +Field descriptions: > +* ''cluster_size'' must be a power of 2 in range [2^12, 2^26]. > +* ''table_size'' must be a power of 2 in range [1, 16]. > +* ''header_size'' is the number of clusters used by the header and any > additional information stored before regular clusters. > +* ''features'', ''compat_features'', and ''autoclear_features'' are file > format extension bitmaps. They work as follows: > +** An image with unknown ''features'' bits enabled must not be opened. File > format changes that are not backwards-compatible must use ''features'' bits. > +** An image with unknown ''compat_features'' bits enabled can be opened > safely. The unknown features are simply ignored and represent > backwards-compatible changes to the file format. > +** An image with unknown ''autoclear_features'' bits enable can be opened > safely after clearing the unknown bits. This allows for backwards-compatible > changes to the file format which degrade gracefully and can be re-enabled > again by a new program later.
autoclear features aren't even part of the header in the spec. > +* ''l1_table_offset'' is the offset of the first byte of the L1 table in the > image file and must be a multiple of ''cluster_size''. > +* ''image_size'' is the block device size seen by the guest and must be a > multiple of 512 bytes. > +* ''backing_filename'' is a string in (byte offset, byte size) form. It is > not NUL-terminated and has no alignment constraints. > + > +Feature bits: > +* QED_F_BACKING_FILE = 0x01. The image uses a backing file. The backing > filename string is given in the ''backing_filename_{offset,size}'' fields and > may be an absolute path or relative to the image file. > +* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use. > +* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk > image and no file format autodetection should be attempted. This should be > used to ensure that raw backing images are never detected as an image format > if they happen to contain magic constants. > + > +There are currently no defined ''compat_features'' or ''autoclear_features'' > bits. > + > +Fields predicated on a feature bit are only used when that feature is set. > The fields always take up header space, regardless of whether or not the > feature bit is set. > + > +==Tables== > + > +Tables provide the translation from logical offsets in the block device to > cluster offsets in the file. > + > + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t)) > + > + Table { > + uint64_t offsets[TABLE_NOFFSETS]; > + } > + > +The tables are organized as follows: > + > + +----------+ > + | L1 table | > + +----------+ > + ,------' | '------. > + +----------+ | +----------+ > + | L2 table | ... | L2 table | > + +----------+ +----------+ > + ,------' | '------. > + +----------+ | +----------+ > + | Data | ... | Data | > + +----------+ +----------+ > + > +A table is made up of one or more contiguous clusters. The table_size > header field determines table size for an image file. For example, > cluster_size=64 KB and table_size=4 results in 256 KB tables. > + > +The logical image size must be less than or equal to the maximum possible > size of clusters rooted by the L1 table: > + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size > + > +All offsets in L1 and L2 tables are cluster-aligned. The least significant > bits up to ''cluster_size'' are reserved and must be zero. I know what you mean here, but the text leaves things a bit unclear. First I would expect a bit number instead of a byte offset for "bits up to x". Second, cluster_size is the first bit not reserved, whereas your description sounds to me as if it included cluster_size. > This may be used in future format extensions to store per-offset information. > + > +The following offsets have special meanings: > + > +===L2 table offsets=== > +* 0 - unallocated. The L2 table is not yet allocated. > + > +===Data cluster offsets=== > +* 0 - unallocated. The data cluster is not yet allocated. > + > +===Unallocated L2 tables and data clusters=== > +Reads to an unallocated area of the image file access the backing file. If > there is no backing file, then zeroes are produced. The backing file may be > smaller than the image file and reads of unallocated areas beyond the end of > the backing file produce zeroes. > + > +Writes to an unallocated area cause a new data clusters to be allocated, and > a new L2 table if that is also unallocated. The new data cluster is > populated with data from the backing image (or zeroes if no backing image) > and the data being written. > + > +===Logical offset translation=== > +Logical offsets are translated into cluster offsets as follows: > + > + table_bits table_bits cluster_bits > + <--------> <--------> <---------------> > + +----------+----------+-----------------+ > + | L1 index | L2 index | byte offset | > + +----------+----------+-----------------+ > + > + Structure of a logical offset > + > + offset_mask = ~(cluster_size - 1) # mask for the image file byte offset > + > + def logical_to_cluster_offset(l1_index, l2_index, byte_offset): > + l2_offset = l1_table[l1_index] > + l2_table = load_table(l2_offset) > + cluster_offset = l2_table[l2_index] & offset_mask > + return cluster_offset + byte_offset > + > +==Consistency checking== > + > +This section is informational and included to provide background on the use > of the QED_F_NEED_CHECK ''features'' bit. > + > +The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting > an operation that could leave the image in an inconsistent state if > interrupted by a crash or power failure. A dirty image must be checked on > open because its metadata may not be consistent. > + > +Consistency check includes the following invariants: > +# Each cluster is referenced once and only once. It is an inconsistency to > have a cluster referenced more than once by L1 or L2 tables. A cluster has > been leaked if it has no references. > +# Offsets must be within the image file size and must be ''cluster_size'' > aligned. > +# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from > the end of the image file so that there is space for the entire table. > + > +The consistency check process starts by from ''l1_table_offset'' and scans > all L2 tables. After the check completes with no other errors besides leaks, > the QED_F_NEED_CHECK bit can be cleared and the image can be accessed. Looks okay otherwise. Kevin