Re: RFC: Qemu backup interface plans

Vladimir Sementsov-Ogievskiy Tue, 18 May 2021 23:12:04 -0700

18.05.2021 19:39, Max Reitz wrote:

Hi,


Your proposal sounds good to me in general.  Some small independent building 
blocks that seems to make sense to me.


Thanks! I hope it's not too difficult to read and understand my English.



On 17.05.21 14:07, Vladimir Sementsov-Ogievskiy wrote:

[...]

What we lack in this scheme:

1. handling dirty bitmap in backup-top filter: backup-top does copy-before-write 
operation on any guest write, when actually we are interested only in "dirty" 
regions for incremental backup

Probable solution would allowing specifying bitmap for sync=none mode of 
backup, but I think what I propose below is better.

2. [actually it's a tricky part of 1]: possibility to not do copy-before-write 
operations for regions that was already copied to final backup. With normal 
Qemu backup job, this is achieved by the fact that block-copy state with its 
internal bitmap is shared between backup job and copy-before-write filter.

3. Not a real problem but fact: backup block-job does nothing in the scheme, 
the whole job is done by filter. So, it would be interesting to have a 
possibility to simply insert/remove the filter, and avoid block-job creation 
and managing at all for external backup. (and I'd like to send another RFC on 
how to insert/remove filters, let's not discuss it here).


Next. Think about internal backup. It has one drawback too:
4. If target is remote with slow connection, copy-before-write operations will 
slow down guest writes appreciably.

It may be solved with help of image fleecing: we create temporary qcow2 image, 
setup fleecing scheme, and instead of exporting temp image through NBD we start 
a second backup with source = temporary image and target would be real backup 
target (NBD for example).


How would a second backup work here?  Wouldn’t one want a mirror job to copy 
the data off to the real target?

(Because I see backup as something intrinsically synchronous, whereas mirror by 
default is rather lazy.)

[Note from future me where I read more below: I see you acknowledge that you’ll 
need to modify backup to do what you need here, i.e. not do any CBW operations. 
 So it’s effectively the same as a mirror that ignores new dirty areas.  Which 
could work without changing mirror if block-copy were to set 
BDRV_REQ_WRITE_UNCHANGED for the fleecing case, and bdrv_co_write_req_finish() 
would skip bdrv_set_dirty() for such writes.]


I just feel myself closer with backup block-job than with mirror :) Finally, 
yes, there is no real difference in interface. But in realization, I prefer to 
continue developing block-copy. I hope, finally all jobs and img-convert would 
work through block-copy.

(and I'll need BDRV_REQ_WRITE_UNCHANGED anyway for fleecing, so user can use 
mirror or backup)


I mean, still has the problem that the mirror job can’t tell the CBW filter 
which areas are already copied off and so don’t need to be preserved anymore, 
but...

Still, with such solution there are same [1,2] problems, 3 becomes worse:


Not sure how 3 can become worse when you said above it isn’t a real problem (to 
which I agree).


It's my perfectionism :) Yes, it's still isn't a problem, but number of extra 
user-visible objects in architecture increases, which is not good I think.

5. We'll have two jobs and two automatically inserted filters, when actually 
one filter and one job are enough (as first job is needed only to insert a 
filter, second job doesn't need a filter at all).

Note also, that this (starting two backup jobs to make push backup with 
fleecing) doesn't work now, op-blockers will be against. It's simple to fix 
(and in Virtuozzo we live with downstream-only patch, which allows push backup 
with fleecing, based on starting two backup jobs).. But I never send a patch, 
as I have better plan, which will solve all listed problems.


So, what I propose:

1. We make backup-top filter public, so that it could be inserted/removed where 
user wants through QMP (how to properly insert/remove filter I'll post another 
RFC, as backup-top is not the only filter that can be usefully inserted 
somewhere). For this first step I've sent a series today:

   subject: [PATCH 00/21] block: publish backup-top filter
   id: <20210517064428.16223-1-vsement...@virtuozzo.com>
   patchew: 
https://patchew.org/QEMU/20210517064428.16223-1-vsement...@virtuozzo.com/

(note, that one of things in this series is rename 
s/backup-top/copy-before-write/, still, I call it backup-top in this letter)

This solves [3]. [4, 5] are solved partly: we still have one extra filter, 
created by backup block jobs, and also I didn't test does this work, probably 
some op-blockers or permissions should be tuned. So, let it be step 2:

2. Test, that we can start backup job with source = (target of backup-top filter), so 
that we have "push backup with fleecing". Make an option for backup to start 
without a filter, when we don't need copy-before-write operations, to not create extra 
superfluous filter.


OK, so the backup job is not really a backup job, but just anything that copies 
data.


Not quite. For backup without a filter we should protect source from changing, 
by unsharing WRITE permission on it.

I'll try to avoid adding an option. The logic should work like in commit job: 
if there are current writers on source we create filter. If there no writers, 
we just unshare writes and go without a filter. And for this copy-before-write 
filter should be able to do WRITE_UNCHANGED in case of fleecing.

3. Support bitmap in backup-top filter, to solve [1]

3.1 and make it possible to modify the bitmap externally, so that consumer of 
fleecing can say to backup-top filter: I've already copied these blocks, don't 
bother with copying them to temp image". This is to solve [2].

Still, how consumer of fleecing will reset shared bitmap after copying blocks? I have the 
following idea: we make a "discard-bitmap-filter" filter driver, that own some 
bitmap and on discard request unset corresponding bits. Also, on read, if read from the 
region with unset bits the EINVAL returned immediately. This way both consumers (backup 
job and NBD client) are able to use this interface:


Sounds almost like a 'bitmap' protocol block driver that, given some dirty 
bitmap, basically just represents that bitmap as a block device. *shrug*

Anyway.  I think I’m wrong, it’s something very different, and that’s clear 
when I turn your proposal around:  What this “filter” would do primarily is to 
restrict access to its filtered node based on the bitmap given to it, i.e. only 
dirty areas can be read.  (I say “filter” because I’m not sure it’s a filter if 
it restricts the data that can be read.) Secondarily, the bitmap can be cleared 
by sending discards.


What rethink filters as "drivers that not contain any data, so filter can always be 
removed from the chain without loosing any data"? And allow filters to restrict 
access to data areas. Or even change the data (and raw format becomes filter).

Note that backup-top filter already may restrict access: if copy-before-write 
operation failed, the filter doesn't propagate write operation to file child 
but return an error to the guest.


You know what, that would allow implement backing files for formats that don’t 
support it.  Like, the overlay and the backing file are both children of a FIFO 
quorum node, where the overlay has the bitmap filter on top, and is the first 
child.  If writing to the bitmap filter then also marks the bitmap dirty there 
(which it logically should, I think)...  Don’t know if that’s good or not. :)


That's interesting. I've never heard of requesting such a feature still.. But 
it may influence how to call this filter. Maybe not discard-bitmap-fitler, but 
just bitmap-filter. And we can add different options to setup, how filter will 
handle requests.

For example, I need the following:

+-------+------------------+---------------------------+-------------------+
|       | read             | write                     | discard           |
+-------+------------------+---------------------------+-------------------+
| clear | EINVAL           | propagate to file         | EINVAL            |
|       |                  | mark dirty                |                   |
|       |                  | (or may be better EINVAL) |                   |
+-------+------------------+---------------------------+-------------------+
| dirty | propaget to file | propagate to file         | propagate to file |
|       |                  | (or may be better EINVAL) | mark clean        |
+-------+------------------+---------------------------+-------------------+

And to realize backing behavior:

+-------+----------------------+-------------------+--------------------+
|       | read                 | write             | discard            |
+-------+----------------------+-------------------+--------------------+
| clear | propagate to backing | propagate to file | maybe zeroize file |
|       |                      | mark dirty        | mark dirty         |
+-------+----------------------+-------------------+--------------------+
| dirty | propagete to file    | propagate to file | propagate to file  |
+-------+----------------------+-------------------+--------------------+

Backup job can simply call discard on source, we can add an option for this.


Hm.  I would have expected the most straightforward solution would be to share 
the job’s (backup or mirror, doesn’t matter) dirty bitmap with the CBW node, so 
that the latter only copies what the former still considers dirty.  Is the 
bitmap filter really still necessary then?


Yes, I think the user given bitmap should be shared between all the consumers. 
Still, internal bitmaps of block-copy entities would be different:

For example, second block-copy which do copy to final target clears bits at 
start of operation, and may reset them to 1 if copy failed for that operation. 
If in a meantime guest write, copy-before-write filter should do block-copy, 
and it doesn't matter that in second block-copy bits are zero.

Or another: first block-copy (in copy-before-write filter) marks bits of its 
internal bitmap zero when cluster copied. But that bit has not relation to 
should second block-copy do copy of this cluster or not.

Now internal bitmap of block-copy just initialized from user given bitmap and 
then user given bitmap is unchanged (except for backup job finalization if 
options says to transactionally update user given bitmap, but I think this mode 
is not for our case). We'll need a possibility to modify user given bitmap so 
that it influence block-copy. So block-copy will have to consider both bitmaps 
and make AND of them.. Or something like this. We'll see, how I implement this 
:)


Oh, I see, discarding also helps to save disk space.  Makes sense then.


Note also, that I want the fleecing scheme to work in the same way for 
push-backup-with-fleecing and for pull-backup, so that user will implement it 
once. Of course, block-copy could have simpler options than adding a filter and 
additional discard logic. But for external backup it seems the most 
straightforward solution. So, let's just reuse it for push-backup-with-fleecing.

External backup tool will send TRIM request after reading some region. This way 
disk space will be freed and no extra copy-before-write operations will be 
done. I also have a side idea that we can implement READ_ONCE flag, so that 
READ and TRIM can be done in one NBD command. But this works only for clients 
that don't want to implement any kind of retrying.


[...]

This way data from copy-before-write filter goes first to ram-cache, and backup 
job could read it from ram. ram-cache will automatically flush data to temp 
qcow2 image, when ram-usage limit is reached. We'll also need a way to say 
backup-job that it should first read clusters that are cached in ram, and only 
then other clusters. So, we'll have a priory for clusters to be copied by 
block-copy:
1. clusters in ram-cache
2. clusters not in temp img (to avoid copy-before-write operations in future)
3. clusters in temp img.

This will be a kind of block_status() thing, that allows a block driver to give 
recommendations on sequence of reading to be effective.


You mean block_status should give that recommendation?  Is that really 
necessary?  I think this is a rather special case, so block-copy could figure 
that out itself.  All it needs to do is for any dirty area determine how deep 
in the backing chain it is: Is it in the ram-cache, is it in temp image, or is 
it below both?  It should be able to figure that out with the *file information 
that block_status returns.


No, I don't propose to extend block_status(), it should be separate interface

Not also, that there is another benefit of such thing: we'll implement this 
callback in qcow2 driver, so that backup will read clusters not in guest 
cluster order, but in host cluster order, to read more sequentially, which 
should bring better performance on rotating disks.


I’m not exactly sure how you envision this to work, but block_status also 
already gives you the host offset in *map.


But block-status doesn't give a possibility to read sequentially. For this, 
user should call block-status several times until the whole disk covered, then 
sort the segments by host offset. I wonder, could it be implemented as some 
iterator, like

read_iter = bdrv_get_sequential_read_iter(source)

while (extents = bdrv_read_next(read_iter)):
  for ext in extents:
    start_writing_task(target, ext.offset, ext.bytes, ext.qiov)

where bdrv_read_next will read guest data in host-cluster-sequence..


--
Best regards,
Vladimir

Re: RFC: Qemu backup interface plans

Reply via email to