[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

Stefan Hajnoczi Mon, 11 Oct 2010 09:00:03 -0700

On Mon, Oct 11, 2010 at 03:58:07PM +0200, Kevin Wolf wrote:
> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> > Signed-off-by: Stefan Hajnoczi <stefa...@linux.vnet.ibm.com>
> > ---
> >  docs/specs/qed_spec.txt |   94 
> > +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 94 insertions(+), 0 deletions(-)
> >  create mode 100644 docs/specs/qed_spec.txt
> > 
> > diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
> > new file mode 100644
> > index 0000000..c942b8e
> > --- /dev/null
> > +++ b/docs/specs/qed_spec.txt
> > @@ -0,0 +1,94 @@
> > +=Specification=
> > +
> > +The file format looks like this:
> > +
> > + +----------+----------+----------+-----+
> > + | cluster0 | cluster1 | cluster2 | ... |
> > + +----------+----------+----------+-----+
> > +
> > +The first cluster begins with the '''header'''.  The header contains 
> > information about where regular clusters start; this allows the header to 
> > be extensible and store extra information about the image file.  A regular 
> > cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 
> > and L2 tables are composed of one or more contiguous clusters.
> > +
> > +Normally the file size will be a multiple of the cluster size.  If the 
> > file size is not a multiple, extra information after the last cluster may 
> > not be preserved if data is written.  Legitimate extra information should 
> > use space between the header and the first regular cluster.
> > +
> > +All fields are little-endian.
> > +
> > +==Header==
> > + Header {
> > +     uint32_t magic;               /* QED\0 */
> > + 
> > +     uint32_t cluster_size;        /* in bytes */
> > +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
> > +     uint32_t header_size;         /* in clusters */
> > + 
> > +     uint64_t features;            /* format feature bits */
> > +     uint64_t compat_features;     /* compat feature bits */
> > +     uint64_t l1_table_offset;     /* in bytes */
> > +     uint64_t image_size;          /* total logical image size, in bytes */
> > + 
> > +     /* if (features & QED_F_BACKING_FILE) */
> > +     uint32_t backing_filename_offset; /* in bytes from start of header */
> > +     uint32_t backing_filename_size;   /* in bytes */
> > + 
> > +     /* if (compat_features & QED_CF_BACKING_FORMAT) */
> > +     uint32_t backing_fmt_offset;  /* in bytes from start of header */
> > +     uint32_t backing_fmt_size;    /* in bytes */
> 
> It was discussed before, but I don't think we came to a conclusion. Are
> there any circumstances under which you don't want to set the
> QED_CF_BACKING_FORMAT flag?


I suggest the following:

QED_CF_BACKING_FORMAT_RAW = 0x1

When set, the backing file is a raw image and should not be probed for
its file format.  The default (unset) means that the backing image file
format may be probed.

Now the backing_fmt_{offset,size} are no longer necessary.

> 
> Also it's unclear what this "if" actually means: If the flag isn't set,
> are the fields zero, are they undefined or are they even completely
> missing and the offsets of the following fields must be adjusted?

I have updated the wiki:

"Fields predicated on a feature bit are only used when that feature is
set. The fields always take up header space, regardless of whether or
not the feature bit is set."

> 
> > + }
> > +
> > +Field descriptions:
> > +* cluster_size must be a power of 2 in range [2^12, 2^26].
> > +* table_size must be a power of 2 in range [1, 16].
> 
> Is there a reason why this must be a power of two?

The power of two makes logical-to-cluster offset translation easy and
cheap:

l2_table = get_l2_table(l1_table[(logical >> l2_shift) & l2_mask])
cluster = l2_table[logical >> l1_shift] + (logical & cluster_mask)

> 
> > +* header_size is the number of clusters used by the header and any 
> > additional information stored before regular clusters.
> > +* features and compat_features are bitmaps where active file format 
> > features can be selectively enabled.  The difference between the two is 
> > that an image file that uses unknown compat_features bits can be safely 
> > opened without knowing how to interpret those bits.  If an image file has 
> > an unsupported features bit set then it is not possible to open that image 
> > (the image is not backwards-compatible).
> > +* l1_table_offset must be a multiple of cluster_size.
> 
> And it is the offset of the first byte of the L1 table in the image file.

Updated, thanks.

> 
> > +* image_size is the block device size seen by the guest and must be a 
> > multiple of cluster_size.
> 
> So there are image sizes that can't be accurately represented in QED? I
> think that's a bad idea. Even more so because I can't see how it greatly
> simplifies implementation (you save the operation for rounding up on
> open/create, that's it) - it looks like a completely arbitrary restriction.

Good point.  I will try to lift this restriction in v3.

> 
> > +* backing_filename and backing_fmt are both strings in (byte offset, byte 
> > size) form.  They are not NUL-terminated and do not have alignment 
> > constraints.
> 
> A description of the meaning of these strings is missing.

Update:
"The backing filename string is given in the
backing_filename_{offset,size} fields and may be an absolute path or
relative to the image file."

> 
> > +
> > +Feature bits:
> > +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> > +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> > +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file 
> > format stored.
> 
> I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
> twice is confusing at first sight.

Updated, thanks.

> 
> > +
> > +==Tables==
> > +
> > +Tables provide the translation from logical offsets in the block device to 
> > cluster offsets in the file.
> > +
> > + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
> > +  
> > + Table {
> > +     uint64_t offsets[TABLE_NOFFSETS];
> > + }
> > +
> > +The tables are organized as follows:
> > +
> > +                    +----------+
> > +                    | L1 table |
> > +                    +----------+
> > +               ,------'  |  '------.
> > +          +----------+   |    +----------+
> > +          | L2 table |  ...   | L2 table |
> > +          +----------+        +----------+
> > +      ,------'  |  '------.
> > + +----------+   |    +----------+
> > + |   Data   |  ...   |   Data   |
> > + +----------+        +----------+
> > +
> > +A table is made up of one or more contiguous clusters.  The table_size 
> > header field determines table size for an image file.  For example, 
> > cluster_size=64 KB and table_size=4 results in 256 KB tables.
> > +
> > +The logical image size must be less than or equal to the maximum possible 
> > size of clusters rooted by the L1 table:
> > + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
> > +
> > +Logical offsets are translated into cluster offsets as follows:
> > +
> > +  table_bits table_bits    cluster_bits
> > +  <--------> <--------> <--------------->
> > + +----------+----------+-----------------+
> > + | L1 index | L2 index |     byte offset |
> > + +----------+----------+-----------------+
> > + 
> > +       Structure of a logical offset
> > +
> > + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
> > +   l2_offset = l1_table[l1_index]
> > +   l2_table = load_table(l2_offset)
> > +   cluster_offset = l2_table[l2_index]
> > +   return cluster_offset + byte_offset
> 
> Should we reserve some bits in the table entries in case we need some
> flags later? Also, I suppose all table entries must be cluster aligned?

Yes, let's do that.  At least for sparse zero cluster tracking we need a
bit.  The minimum 4k cluster size gives us 12 bits to play with.

> 
> What happened to the other sections that older versions of the spec
> contained? For example, this version doesn't specify any more what the
> semantics of unallocated clusters and backing files is.

I removed them because they don't describe the on-disk layout and were
more of a way to think through the implementation than a format
specification.  It was more a decision to focus my effort on improving
the on-disk layout specification than anything else.

Do you want the semantics in the specification, or is it okay to leave
that part on the wiki only?

Stefan

[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

Reply via email to