Here is a summary of how qed images can be accessed safely after a crash or power loss.
First off, we only need to consider write operations since read operations do not change the state of the image file and cannot lead to metadata corruption. There are two types of writes. Allocating writes which are necessary when no cluster has been allocated for this logical block and in-place writes when a cluster has previously been allocated. In-place writes overwrite old data in the image file. They do not allocate new clusters or update any metadata. This is why write performance is comparable to raw in the long run. Once you've done the hard work of allocating a cluster you can write and re-write its sectors because the cluster stays put. The failure scenario here is the same as for a raw image: power loss means that data may or may not be written to disk and perhaps not all sectors were written. It is up to the guest to handle recovery and the qed metadata has not been corrupted. Allocating writes fall into two cases: 1. There is no existing L2 table to link the data cluster into. Allocate and write the data cluster, allocate an L2 table, link up the data cluster in the L2 table, fsync(), and link up the L2 table in the L1 table. Notice the fsync() between the L2 update and L1 update ensures that the L1 table always points to a complete L2 table. 2. There is an existing L2 table to link the data cluster into. Allocate and write the data cluster, link up the data cluster in the L2 table. Notice that there is no flush operation between writing the data and updating the metadata. Since there is no ordering imposed between the data write and metadata update, the following scenarios may occur on crash: 1. Neither data write nor metadata update reach the disk. This is fine, qed metadata has not been corrupted. 2. Data reaches disk but metadata update does not. We have leaked a cluster but not corrupted metadata. Leaked clusters can be detected with qemu-img check. Note that if file size is not a multiple of cluster size, then the file size is rounded down by cluster size. That means the next cluster allocation will claim the partial write at the end of the file. 3. Metadata update reaches disk but data does not. The interesting case! The L2 table now points to a cluster which is beyond the last cluster in the image file. Remember that file size is rounded down by cluster size, so partial data writes are discarded and this case applies. Now we're in trouble. The image cannot be accessed without some sanity checking because not only do table entries point to invalid clusters, but new allocating writes might make previously invalid cluster offsets valid again (then there would be two or more table entries pointing to the same cluster)! Anthony's suggestion is to use a "mounted" or "dirty" bit in the qed header to detect a crashed image when opening the image file. If no crash has occurred, then the mounted bit is unset and normal operation is safe. If the mounted bit is set, then an check of the L1/L2 tables must be performed and any invalid cluster offsets must be cleared to zero. When an invalid cluster is cleared to zero, we arrive back at case 1 above: neither data write nor metadata update reached the disk, and we are in a safe state. 4. Both data and metadata reach disk. No problem. Have I missed anything? Stefan