Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

Avi Kivity Tue, 12 Oct 2010 03:26:05 -0700

 On 10/11/2010 06:10 PM, Anthony Liguori wrote:

On 10/11/2010 11:02 AM, Avi Kivity wrote:
 On 10/11/2010 05:49 PM, Anthony Liguori wrote:
On 10/11/2010 09:58 AM, Avi Kivity wrote:
A leak is unacceptable. It means an image can grow to anunbounded size. If you are a server provider offeringmultitenancy, then a malicious guest can potentially grow theimage beyond it's allotted size causing a Denial of Service attackagainst another tenant.
This particular leak cannot grow, and is not controlled by the guest.
As the image gets moved from hypervisor to hypervisor, it can keepgrowing if given a chance to fill up the disk, then trim it all way.
In a mixed hypervisor environment, it just becomes a numbers game.
I don't see how it can grow. Both the freelist and the clusters itpoints to consume space, which becomes a leak once you move it to ahypervisor that doesn't understand the freelist. The olderhypervisor then allocates new blocks. As soon as it performs ametadata scan (if ever), the freelist is reclaimed.
Assume you don't ever do a metadata scan (which is really our designpoint).


What about crashes?

If you move to a hypervisor that doesn't support it, then move to ahypervisor that does, you create a brand new freelist and startleaking more space. This isn't a contrived scenario if you have acloud environment with a mix of hosts.


It's only a leak if you don't do a metadata scan.

You might not be able to get a ping-pong every time you provision, butwith enough effort, you could create serious problems.
It's really an issue of correctness. Making correctness trade-offsfor the purpose of compatibility is a policy decision and notsomething we should bake into an image format. If a tool feelsstrongly that it's a reasonable trade off to make, it can always fudgethe feature bits itself.

I think the effort here is reasonable, clearing a bit on startup is notthat complicated.

A potential solution here is to treat TRIM a little differently thanwe've been discussing.
When TRIM happens, don't immediately write an unallocated clusterentry for the L2. Leave the L2 entry in-tact. Don't actually writea UCE to the L2 until you actually allocate the block.
This implies a cost because you'll need to do metadata syncs to makethis work. However, that eliminates leakage.
The information is lost on shutdown; and you can have a large numberof unallocated-in-waiting clusters (like a TRIM issued by mkfs, or auser expecting a visit from RIAA).
A slight twist on your proposal is to have an allocated-but-may-dropbit in a L2 entry. TRIM or zero detection sets the bit (leaving thecluster number intact). A following write to the cluster needs toclear the bit; if we reallocate the cluster we need to replace itwith a ZCE.
Yeah, this is sort of what I was thinking. You would still want afree list but it becomes totally optional because if it's lost, nodata is leaked (assuming that the older version understands the bit).
I was suggesting that we store that bit in the free list thoughbecause that let's us support having older QEMUs with absolutely noknowledge still work.

It doesn't - on rewrite an old qemu won't clear the bit, so a newer qemuwould think it's still free.

The autoclear bit solves it nicely - the old qemu automatically dropsthe allocated-but-may-drop bits, undoing any TRIMs (which isunfortunate) but preserving consistency.




--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

Reply via email to