Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

Stefan Hajnoczi Wed, 08 Sep 2010 04:15:32 -0700

Here is a summary of how qed images can be accessed safely after a
crash or power loss.


First off, we only need to consider write operations since read
operations do not change the state of the image file and cannot lead
to metadata corruption.

There are two types of writes.  Allocating writes which are necessary
when no cluster has been allocated for this logical block and in-place
writes when a cluster has previously been allocated.

In-place writes overwrite old data in the image file.  They do not
allocate new clusters or update any metadata.  This is why write
performance is comparable to raw in the long run.  Once you've done
the hard work of allocating a cluster you can write and re-write its
sectors because the cluster stays put.  The failure scenario here is
the same as for a raw image: power loss means that data may or may not
be written to disk and perhaps not all sectors were written.  It is up
to the guest to handle recovery and the qed metadata has not been
corrupted.

Allocating writes fall into two cases:
1. There is no existing L2 table to link the data cluster into.
Allocate and write the data cluster, allocate an L2 table, link up the
data cluster in the L2 table, fsync(), and link up the L2 table in the
L1 table.  Notice the fsync() between the L2 update and L1 update
ensures that the L1 table always points to a complete L2 table.

2. There is an existing L2 table to link the data cluster into.
Allocate and write the data cluster, link up the data cluster in the
L2 table.  Notice that there is no flush operation between writing the
data and updating the metadata.

Since there is no ordering imposed between the data write and metadata
update, the following scenarios may occur on crash:
1. Neither data write nor metadata update reach the disk.  This is
fine, qed metadata has not been corrupted.

2. Data reaches disk but metadata update does not.  We have leaked a
cluster but not corrupted metadata.  Leaked clusters can be detected
with qemu-img check.  Note that if file size is not a multiple of
cluster size, then the file size is rounded down by cluster size.
That means the next cluster allocation will claim the partial write at
the end of the file.

3. Metadata update reaches disk but data does not.  The interesting
case!  The L2 table now points to a cluster which is beyond the last
cluster in the image file.  Remember that file size is rounded down by
cluster size, so partial data writes are discarded and this case
applies.

Now we're in trouble.  The image cannot be accessed without some
sanity checking because not only do table entries point to invalid
clusters, but new allocating writes might make previously invalid
cluster offsets valid again (then there would be two or more table
entries pointing to the same cluster)!

Anthony's suggestion is to use a "mounted" or "dirty" bit in the qed
header to detect a crashed image when opening the image file.  If no
crash has occurred, then the mounted bit is unset and normal operation
is safe.  If the mounted bit is set, then an check of the L1/L2 tables
must be performed and any invalid cluster offsets must be cleared to
zero.  When an invalid cluster is cleared to zero, we arrive back at
case 1 above: neither data write nor metadata update reached the disk,
and we are in a safe state.

4. Both data and metadata reach disk.  No problem.

Have I missed anything?

Stefan

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

Reply via email to