On Fri, 09/06 10:45, Kevin Wolf wrote: > Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben: > > On Tue, 09/03 18:24, Benoît Canet wrote: > > > > > > Hello list, > > > > > > I am thinking about QEMU block filters lately. > > > > > > I am not a block.c/blockdev.c expert so tell me what you think of the > > > following. > > > > > > The use cases I see would be: > > > > > > -$user want to have some real cryptography on top of qcow2/qed or another > > > format. > > > snapshots and other block features should continue to work > > > > > > -$user want to use a raid like feature like QUORUM in QEMU. > > > other features should continue to work > > > > > > -$user want to use the future SSD deduplication implementation with > > > metadata on > > > SSD and data on spinning disks. > > > other features should continue to work > > > > > > -$user want to I/O throttle one drive of his vm. > > > > > > -$user want to do Copy On Read > > > > > > -$user want to do a combination of the above > > > > > > -$developer want to make the minimum of required steps to keep changes > > > small > > > > > > -$developer want to keep user interface changes for later > > > > > > Lets take a example case of an user wanting to do I/O throttled encrypted > > > QUORUM > > > on top of QCOW2. > > > > > > Assuming we want to implement throttle and encryption as something > > > remotely > > > being like a block filter this makes a pretty complex BlockDriverState > > > tree. > > > > > > The tree would look like the following: > > > > > > I/O throttling BlockDriverState (bs) > > > | > > > | > > > | > > > | > > > Encryption BlockDriverState (bs) > > > | > > > | > > > | > > > | > > > Quorum BlockDriverState (bs) > > > / | \ > > > / | \ > > > / | \ > > > / | \ > > > QCOW2 bs QCOW2 b s QCOW2 bs > > > | | | > > > | | | > > > | | | > > > | | | > > > RAW bs RAW bs RAW bs > > > > > > An external snapshot should result in a tree like the following. > > > I/O throttling BlockDriverState (bs) > > > | > > > | > > > | > > > | > > > Encryption BlockDriverState (bs) > > > | > > > | > > > | > > > | > > > Quorum BlockDriverState (bs) > > > / | \ > > > / | \ > > > / | \ > > > / | \ > > > QCOW2 bs QCOW2 bs QCOW2 bs > > > | | | > > > | | | > > > | | | > > > | | | > > > QCOW2 bs QCOW2 bs QCOW2 bs > > > | | | > > > | | | > > > | | | > > > | | | > > > RAW bs RAW bs RAW bs > > > > > > In the current state of QEMU we can code some block drivers to implement > > > this > > > tree. > > > > > > However when doing operations like snapshots blockdev.c would have no > > > real idea > > > of what should be snapshotted and how. (The 3 top bs should be kept on > > > top) > > > > > > Moreover it would have no way to manipulate easily this tree of > > > BlockDriverState > > > has each one is encapsulated in it's parent. > > > > > > Also there no generic way to tell the block layer that two or more > > > BlockDriverState > > > are siblings. > > > > > > The current mail is here to propose some additionals structures in order > > > to cope > > > with these problems. > > > > > > The overall strategy of the proposed structures is to push out the > > > BlockDriverStates relationships out of each BlockDriverState. > > > > > > The idea is that it would make it easier for the block layer to > > > manipulate a > > > well known structure instead of being forced to enter into each > > > BlockDriverState > > > specificity. > > > > > > The first structure is the BlockStackNode. > > > > > > The BlockStateNode would be used to represent the relationship between the > > > various BlockDriverStates > > > > > > struct BlockStackNode { > > > BlockDriverState *bs; /* the BlockDriverState holded by this node */ > > > > > > /* this doubly linked list entry points to the child node and the > > > parent > > > * node > > > */ > > > QLIST_ENTRY(BlockStateNode) down; > > > > > > /* This doubly linked list entry point to the siblings of this node > > > */ > > > QLIST_ENTRY(BlockStateNode) siblings; > > > > > > /* a hash or an array of the sibbling of this node for fast access > > > * should be recomputed when updating the tree */ > > > QHASH_ENTRY<BlockStateNode, index> sibblings_hash; > > > } > > > > > > The BlockBackend would be the structure used to hold the "drive" the > > > guest use. > > > > > > struct BlockBackend { > > > /* the following doubly linked list header point to the top > > > BlockStackNode > > > * in our case it's the one containing the I/O throttling bs > > > */ > > > QLIST_HEAD(, BlockStateNode) block_stack_head; > > > /* this is a pointer to the topest node below the block filter chain > > > * in our case the first QCOW2 sibling > > > */ > > > BlockStackNode *top_node_below_filters; > > > } > > > > > > > > > Updated diagram: > > > > > > (Here bsn means BlockStacknode) > > > > > > ------------------------BlockBackend > > > | | > > > | block_stack_head > > > | | > > > | | > > > | I/O throttling BlockStackNode (contains it's > > > bs) > > > | | > > > | down > > > | | > > > | | > > > top_node_below_filter Encryption BlockStacknode (contains it's bs) > > > | | > > > | down > > > | | > > > | | > > > | Quorum BlockStackNode (contain's it's bs) > > > | / > > > | down > > > | / > > > | / S S > > > ------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn > > > contains a bs) > > > | b | b | > > > down l down l down > > > | i | i | > > > | n | n | > > > | g | g | > > > | s | s | > > > | | | > > > RAW bsn RAW bsn RAW bsn (each bsn contains > > > a bs) > > > > > > > > > Block driver point of view: > > > > > > to construct the tree each BlockDriver would have some utility functions > > > looking > > > like. > > > > > > bdrv_register_child_bs(bs, child_bs, int index); > > > > > > multiples calls to this function could be done to register multiple > > > siblings > > > childs identified by their index. > > > > > > This way something like quorum could register multiple QCOW2 instances. > > > > > > driver would have a > > > BlockDriverSTate *bdrv_access_child(bs, int index); > > > > > > to access their childs. > > > > > > These functions can be implemented without the driver knowing about > > > BlockStateNodes using container_of. > > > > > > blockdev point of view: (here I need your help) > > > > > > When doing a snapshot blockdev.c would access > > > BlockBackend->top_node_below_filter and make a snapshot of the bs > > > contained in > > > this node and it's sibblings. > > > > > Since BlockDriver.bdrv_snapshot_create() is an optional operation, > > blockdev.c > > can navigate down the tree from top node, until hitting some layer where > > the op > > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter > > pointer. > > Is it even inherent to a block driver (like a filter), if a snapshot is > to be taken at its level? Or is it rather a policy decision that should > be made by the user? > OK, getting the point that user should have full flexibility and fine operation granularity. It also stands against block_backend->top_node_below_filter. Do we really have the assumption that all the filters are on top of the tree and linear? Shouldn't this be possible?
Block Backend | | Quodrum BDS / | \ iot filter | \ / | \ qcow2 qcow2 qcow2 So we throttle only a particular image, not the whole device. But this will make a top_node_below_filter pointer impossible. > In our example, the quorum driver, it's not at all clear to me that you > want to snapshot all children. In order to roll back to a previous > state, one snapshot is enough, you don't need multiple copies of the > same one. Perhaps you want two so that we can still compare them for > verification. Or all of them because you can afford the disk space and > want ultimate safety. I don't think qemu can know which one is true. > Only if quorum ever knows about and operates on snapshots, it should be considered specifically, but no. So we need to achieve this in the general design: allow user to take snapshot, or set throttle limits on particular BDSes, as above graph. > In the same way, in a typical case you may want to keep I/O throttling > for the whole drive, including the new snapshot. But what if the > throttling was used in order to not overload the network where the image > is stored, and you're now doing a local snapshot, to which you want to > stream the image? The I/O throttling should apply only to the backing > file, not the new snapshot. > Yes, and OTOH, throttling really suits to be a filter only if it can be a non top one, otherwise it's no better than what we have now. > So perhaps what we really need is a more flexible snapshot/BDS tree > manipulation command that describes in detail which structure you want > to have in the end. > Fam