Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

Anthony Liguori Wed, 08 Sep 2010 05:48:31 -0700

On 09/08/2010 03:23 AM, Avi Kivity wrote:

 On 09/08/2010 01:27 AM, Anthony Liguori wrote:
FWIW, L2s are 256K at the moment and with a two level table, it cansupport 5PB of data.
I clearly suck at basic math today. The image supports 64TB today.Dropping to 128K tables would reduce it to 16TB and 64k tables wouldbe 4TB.
Maybe we should do three levels then. Some users are bound tocomplain about 64TB.

That's just the default size. The table size and cluster sizes areconfigurable. Without changing the cluster size, the image can supportup to 1PB.

BTW, I don't think your checksumming idea is sound. If you store a64-bit checksum along side each point, it becomes necessary to updatethe parent pointer every time the table changes. This introduces anordering requirement which means you need to sync() the file everytime you update and L2 entry.
Even worse, if the crash happens between an L2 update and an L1checksum update, the entire cluster goes away. You really wantallocate-on-write for this.
Today, we only need to sync() when we first allocate an L2 entry(because their locations never change). From a performanceperspective, it's the difference between an fsync() every 64k vs.every 2GB.
Yup. From a correctness perspective, it's the difference between acorrupted filesystem on almost every crash and a corrupted filesystemin some very rare cases.

I'm not sure I understand you're corruption comment. Are you claimingthat without checksumming, you'll often get corruption or are youclaiming that without checksums, if you don't sync metadata updatesyou'll get corruption?

qed is very careful about ensuring that we don't need to do syncs and wedon't get corruption because of data loss. I don't necessarily buy yourchecksumming argument.

Plus, doesn't btrfs do block level checksumming? IOW, if you run aworkload where you care about this level of data integrityvalidation, if you did btrfs + qed, you would be fine.
Or just btrfs by itself (use btrfs for snapshots and base images, useqemu-img convert for shipping).
Since the majority of file systems don't do metadata checksumming,it's not obvious to me that we should be.
The logic is that as data sizes increase, the probablity of errorincreases.
I think one of the critical flaws in qcow2 was trying to invent abetter filesystem within qemu instead of just sticking to a verysimple and obviously correct format and letting the FS folks do thereally fancy stuff.
Well, if we introduce a minimal format, we need to make sure it isn'ttoo minimal.
I'm still not sold on the idea. What we're doing now is pushing theqcow2 complexity to users. We don't have to worry about refcountsnow, but users have to worry whether they're the machine they'recopying the image to supports qed or not.
The performance problems with qcow2 are solvable. If we preallocateclusters, the performance characteristics become essentially the sameas qed.

By creating two code paths within qcow2. It's not just the referencecounts, it's the lack of guaranteed alignment, compression, and some ofthe other poor decisions in the format.

If you have two code paths in qcow2, you have non-deterministicperformance because users that do reasonable things with their imageswill end up getting catastrophically bad performance.

A new format doesn't introduce much additional complexity. We provideimage conversion tool and we can almost certainly provide an in-placeconversion tool that makes the process very fast.


Regards,

Anthony Liguori

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

Reply via email to