On Mon, Oct 11, 2010 at 03:58:07PM +0200, Kevin Wolf wrote: > Am 08.10.2010 17:48, schrieb Stefan Hajnoczi: > > Signed-off-by: Stefan Hajnoczi <stefa...@linux.vnet.ibm.com> > > --- > > docs/specs/qed_spec.txt | 94 > > +++++++++++++++++++++++++++++++++++++++++++++++ > > 1 files changed, 94 insertions(+), 0 deletions(-) > > create mode 100644 docs/specs/qed_spec.txt > > > > diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt > > new file mode 100644 > > index 0000000..c942b8e > > --- /dev/null > > +++ b/docs/specs/qed_spec.txt > > @@ -0,0 +1,94 @@ > > +=Specification= > > + > > +The file format looks like this: > > + > > + +----------+----------+----------+-----+ > > + | cluster0 | cluster1 | cluster2 | ... | > > + +----------+----------+----------+-----+ > > + > > +The first cluster begins with the '''header'''. The header contains > > information about where regular clusters start; this allows the header to > > be extensible and store extra information about the image file. A regular > > cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 > > and L2 tables are composed of one or more contiguous clusters. > > + > > +Normally the file size will be a multiple of the cluster size. If the > > file size is not a multiple, extra information after the last cluster may > > not be preserved if data is written. Legitimate extra information should > > use space between the header and the first regular cluster. > > + > > +All fields are little-endian. > > + > > +==Header== > > + Header { > > + uint32_t magic; /* QED\0 */ > > + > > + uint32_t cluster_size; /* in bytes */ > > + uint32_t table_size; /* for L1 and L2 tables, in clusters */ > > + uint32_t header_size; /* in clusters */ > > + > > + uint64_t features; /* format feature bits */ > > + uint64_t compat_features; /* compat feature bits */ > > + uint64_t l1_table_offset; /* in bytes */ > > + uint64_t image_size; /* total logical image size, in bytes */ > > + > > + /* if (features & QED_F_BACKING_FILE) */ > > + uint32_t backing_filename_offset; /* in bytes from start of header */ > > + uint32_t backing_filename_size; /* in bytes */ > > + > > + /* if (compat_features & QED_CF_BACKING_FORMAT) */ > > + uint32_t backing_fmt_offset; /* in bytes from start of header */ > > + uint32_t backing_fmt_size; /* in bytes */ > > It was discussed before, but I don't think we came to a conclusion. Are > there any circumstances under which you don't want to set the > QED_CF_BACKING_FORMAT flag?
I suggest the following: QED_CF_BACKING_FORMAT_RAW = 0x1 When set, the backing file is a raw image and should not be probed for its file format. The default (unset) means that the backing image file format may be probed. Now the backing_fmt_{offset,size} are no longer necessary. > > Also it's unclear what this "if" actually means: If the flag isn't set, > are the fields zero, are they undefined or are they even completely > missing and the offsets of the following fields must be adjusted? I have updated the wiki: "Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set." > > > + } > > + > > +Field descriptions: > > +* cluster_size must be a power of 2 in range [2^12, 2^26]. > > +* table_size must be a power of 2 in range [1, 16]. > > Is there a reason why this must be a power of two? The power of two makes logical-to-cluster offset translation easy and cheap: l2_table = get_l2_table(l1_table[(logical >> l2_shift) & l2_mask]) cluster = l2_table[logical >> l1_shift] + (logical & cluster_mask) > > > +* header_size is the number of clusters used by the header and any > > additional information stored before regular clusters. > > +* features and compat_features are bitmaps where active file format > > features can be selectively enabled. The difference between the two is > > that an image file that uses unknown compat_features bits can be safely > > opened without knowing how to interpret those bits. If an image file has > > an unsupported features bit set then it is not possible to open that image > > (the image is not backwards-compatible). > > +* l1_table_offset must be a multiple of cluster_size. > > And it is the offset of the first byte of the L1 table in the image file. Updated, thanks. > > > +* image_size is the block device size seen by the guest and must be a > > multiple of cluster_size. > > So there are image sizes that can't be accurately represented in QED? I > think that's a bad idea. Even more so because I can't see how it greatly > simplifies implementation (you save the operation for rounding up on > open/create, that's it) - it looks like a completely arbitrary restriction. Good point. I will try to lift this restriction in v3. > > > +* backing_filename and backing_fmt are both strings in (byte offset, byte > > size) form. They are not NUL-terminated and do not have alignment > > constraints. > > A description of the meaning of these strings is missing. Update: "The backing filename string is given in the backing_filename_{offset,size} fields and may be an absolute path or relative to the image file." > > > + > > +Feature bits: > > +* QED_F_BACKING_FILE = 0x01. The image uses a backing file. > > +* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use. > > +* QED_CF_BACKING_FORMAT = 0x01. The image has a specific backing file > > format stored. > > I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01 > twice is confusing at first sight. Updated, thanks. > > > + > > +==Tables== > > + > > +Tables provide the translation from logical offsets in the block device to > > cluster offsets in the file. > > + > > + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t)) > > + > > + Table { > > + uint64_t offsets[TABLE_NOFFSETS]; > > + } > > + > > +The tables are organized as follows: > > + > > + +----------+ > > + | L1 table | > > + +----------+ > > + ,------' | '------. > > + +----------+ | +----------+ > > + | L2 table | ... | L2 table | > > + +----------+ +----------+ > > + ,------' | '------. > > + +----------+ | +----------+ > > + | Data | ... | Data | > > + +----------+ +----------+ > > + > > +A table is made up of one or more contiguous clusters. The table_size > > header field determines table size for an image file. For example, > > cluster_size=64 KB and table_size=4 results in 256 KB tables. > > + > > +The logical image size must be less than or equal to the maximum possible > > size of clusters rooted by the L1 table: > > + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size > > + > > +Logical offsets are translated into cluster offsets as follows: > > + > > + table_bits table_bits cluster_bits > > + <--------> <--------> <---------------> > > + +----------+----------+-----------------+ > > + | L1 index | L2 index | byte offset | > > + +----------+----------+-----------------+ > > + > > + Structure of a logical offset > > + > > + def logical_to_cluster_offset(l1_index, l2_index, byte_offset): > > + l2_offset = l1_table[l1_index] > > + l2_table = load_table(l2_offset) > > + cluster_offset = l2_table[l2_index] > > + return cluster_offset + byte_offset > > Should we reserve some bits in the table entries in case we need some > flags later? Also, I suppose all table entries must be cluster aligned? Yes, let's do that. At least for sparse zero cluster tracking we need a bit. The minimum 4k cluster size gives us 12 bits to play with. > > What happened to the other sections that older versions of the spec > contained? For example, this version doesn't specify any more what the > semantics of unallocated clusters and backing files is. I removed them because they don't describe the on-disk layout and were more of a way to think through the implementation than a format specification. It was more a decision to focus my effort on improving the on-disk layout specification than anything else. Do you want the semantics in the specification, or is it okay to leave that part on the wiki only? Stefan