13.08.2019 17:57, Max Reitz wrote: > On 13.08.19 16:39, Vladimir Sementsov-Ogievskiy wrote: >> 13.08.2019 17:23, Max Reitz wrote: >>> On 13.08.19 16:14, Vladimir Sementsov-Ogievskiy wrote: >>>> 12.08.2019 19:37, Vladimir Sementsov-Ogievskiy wrote: >>>>> 12.08.2019 19:11, Max Reitz wrote: >>>>>> On 12.08.19 17:47, Vladimir Sementsov-Ogievskiy wrote: >>>>>>> 12.08.2019 18:10, Max Reitz wrote: >>>>>>>> On 10.08.19 21:31, Vladimir Sementsov-Ogievskiy wrote: >>>>>>>>> backup_cow_with_offload can transfer more than one cluster. Let >>>>>>>>> backup_cow_with_bounce_buffer behave similarly. It reduces the number >>>>>>>>> of IO requests, since there is no need to copy cluster by cluster. >>>>>>>>> >>>>>>>>> Logic around bounce_buffer allocation changed: we can't just allocate >>>>>>>>> one-cluster-sized buffer to share for all iterations. We can't also >>>>>>>>> allocate buffer of full-request length it may be too large, so >>>>>>>>> BACKUP_MAX_BOUNCE_BUFFER is introduced. And finally, allocation logic >>>>>>>>> is to allocate a buffer sufficient to handle all remaining iterations >>>>>>>>> at the point where we need the buffer for the first time. >>>>>>>>> >>>>>>>>> Bonus: get rid of pointer-to-pointer. >>>>>>>>> >>>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> >>>>>>>>> --- >>>>>>>>> block/backup.c | 65 >>>>>>>>> +++++++++++++++++++++++++++++++------------------- >>>>>>>>> 1 file changed, 41 insertions(+), 24 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/block/backup.c b/block/backup.c >>>>>>>>> index d482d93458..65f7212c85 100644 >>>>>>>>> --- a/block/backup.c >>>>>>>>> +++ b/block/backup.c >>>>>>>>> @@ -27,6 +27,7 @@ >>>>>>>>> #include "qemu/error-report.h" >>>>>>>>> #define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16) >>>>>>>>> +#define BACKUP_MAX_BOUNCE_BUFFER (64 * 1024 * 1024) >>>>>>>>> typedef struct CowRequest { >>>>>>>>> int64_t start_byte; >>>>>>>>> @@ -98,44 +99,55 @@ static void cow_request_end(CowRequest *req) >>>>>>>>> qemu_co_queue_restart_all(&req->wait_queue); >>>>>>>>> } >>>>>>>>> -/* Copy range to target with a bounce buffer and return the bytes >>>>>>>>> copied. If >>>>>>>>> - * error occurred, return a negative error number */ >>>>>>>>> +/* >>>>>>>>> + * Copy range to target with a bounce buffer and return the bytes >>>>>>>>> copied. If >>>>>>>>> + * error occurred, return a negative error number >>>>>>>>> + * >>>>>>>>> + * @bounce_buffer is assumed to enough to store >>>>>>>> >>>>>>>> s/to/to be/ >>>>>>>> >>>>>>>>> + * MIN(BACKUP_MAX_BOUNCE_BUFFER, @end - @start) bytes >>>>>>>>> + */ >>>>>>>>> static int coroutine_fn >>>>>>>>> backup_cow_with_bounce_buffer(BackupBlockJob *job, >>>>>>>>> int64_t >>>>>>>>> start, >>>>>>>>> int64_t >>>>>>>>> end, >>>>>>>>> bool >>>>>>>>> is_write_notifier, >>>>>>>>> bool >>>>>>>>> *error_is_read, >>>>>>>>> - void >>>>>>>>> **bounce_buffer) >>>>>>>>> + void >>>>>>>>> *bounce_buffer) >>>>>>>>> { >>>>>>>>> int ret; >>>>>>>>> BlockBackend *blk = job->common.blk; >>>>>>>>> - int nbytes; >>>>>>>>> + int nbytes, remaining_bytes; >>>>>>>>> int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING >>>>>>>>> : 0; >>>>>>>>> assert(QEMU_IS_ALIGNED(start, job->cluster_size)); >>>>>>>>> - bdrv_reset_dirty_bitmap(job->copy_bitmap, start, >>>>>>>>> job->cluster_size); >>>>>>>>> - nbytes = MIN(job->cluster_size, job->len - start); >>>>>>>>> - if (!*bounce_buffer) { >>>>>>>>> - *bounce_buffer = blk_blockalign(blk, job->cluster_size); >>>>>>>>> - } >>>>>>>>> + bdrv_reset_dirty_bitmap(job->copy_bitmap, start, end - start); >>>>>>>>> + nbytes = MIN(end - start, job->len - start); >>>>>>>>> - ret = blk_co_pread(blk, start, nbytes, *bounce_buffer, >>>>>>>>> read_flags); >>>>>>>>> - if (ret < 0) { >>>>>>>>> - trace_backup_do_cow_read_fail(job, start, ret); >>>>>>>>> - if (error_is_read) { >>>>>>>>> - *error_is_read = true; >>>>>>>>> + >>>>>>>>> + remaining_bytes = nbytes; >>>>>>>>> + while (remaining_bytes) { >>>>>>>>> + int chunk = MIN(BACKUP_MAX_BOUNCE_BUFFER, remaining_bytes); >>>>>>>>> + >>>>>>>>> + ret = blk_co_pread(blk, start, chunk, bounce_buffer, >>>>>>>>> read_flags); >>>>>>>>> + if (ret < 0) { >>>>>>>>> + trace_backup_do_cow_read_fail(job, start, ret); >>>>>>>>> + if (error_is_read) { >>>>>>>>> + *error_is_read = true; >>>>>>>>> + } >>>>>>>>> + goto fail; >>>>>>>>> } >>>>>>>>> - goto fail; >>>>>>>>> - } >>>>>>>>> - ret = blk_co_pwrite(job->target, start, nbytes, *bounce_buffer, >>>>>>>>> - job->write_flags); >>>>>>>>> - if (ret < 0) { >>>>>>>>> - trace_backup_do_cow_write_fail(job, start, ret); >>>>>>>>> - if (error_is_read) { >>>>>>>>> - *error_is_read = false; >>>>>>>>> + ret = blk_co_pwrite(job->target, start, chunk, bounce_buffer, >>>>>>>>> + job->write_flags); >>>>>>>>> + if (ret < 0) { >>>>>>>>> + trace_backup_do_cow_write_fail(job, start, ret); >>>>>>>>> + if (error_is_read) { >>>>>>>>> + *error_is_read = false; >>>>>>>>> + } >>>>>>>>> + goto fail; >>>>>>>>> } >>>>>>>>> - goto fail; >>>>>>>>> + >>>>>>>>> + start += chunk; >>>>>>>>> + remaining_bytes -= chunk; >>>>>>>>> } >>>>>>>>> return nbytes; >>>>>>>>> @@ -301,9 +313,14 @@ static int coroutine_fn >>>>>>>>> backup_do_cow(BackupBlockJob *job, >>>>>>>>> } >>>>>>>>> } >>>>>>>>> if (!job->use_copy_range) { >>>>>>>>> + if (!bounce_buffer) { >>>>>>>>> + size_t len = MIN(BACKUP_MAX_BOUNCE_BUFFER, >>>>>>>>> + MAX(dirty_end - start, end - >>>>>>>>> dirty_end)); >>>>>>>>> + bounce_buffer = blk_try_blockalign(job->common.blk, >>>>>>>>> len); >>>>>>>>> + } >>>>>>>> >>>>>>>> If you use _try_, you should probably also check whether it succeeded. >>>>>>> >>>>>>> Oops, you are right, of course. >>>>>>> >>>>>>>> >>>>>>>> Anyway, I wonder whether it’d be better to just allocate this buffer >>>>>>>> once per job (the first time we get here, probably) to be of size >>>>>>>> BACKUP_MAX_BOUNCE_BUFFER and put it into BackupBlockJob. (And maybe >>>>>>>> add >>>>>>>> a buf-size parameter similar to what the mirror jobs have.) >>>>>>>> >>>>>>> >>>>>>> Once per job will not work, as we may have several guest writes in >>>>>>> parallel and therefore >>>>>>> several parallel copy-before-write operations. >>>>>> >>>>>> Hm. I’m not quite happy with that because if the guest just issues many >>>>>> large discards in parallel, this means that qemu will allocate a large >>>>>> amount of memory. >>>>>> >>>>>> It would be nice if there was a simple way to keep track of the total >>>>>> memory usage and let requests yield if they would exceed it. >>>>> >>>>> Agree, it should be fixed anyway. >>>>> >>>> >>>> >>>> But still.. >>>> >>>> Synchronous mirror allocates full-request buffers on guest write. Is it >>>> correct? >>>> >>>> If we assume that it is correct to double memory usage of guest >>>> operations, than for backup >>>> the problem is only in write_zero and discard where guest-assumed memory >>>> usage should be zero. >>> >>> Well, but that is the problem. I didn’t say anything in v2, because I >>> only thought of normal writes and I found it fine to double the memory >>> usage there (a guest won’t issue huge write requests in parallel). But >>> discard/write-zeroes are a different matter. >>> >>>> And if we should distinguish writes from write_zeroes and discard, it's >>>> better to postpone this >>>> improvement to be after backup-top filter merged. >>> >>> But do you need to distinguish it? Why not just keep track of memory >>> usage and put the current I/O coroutine to sleep in a CoQueue or >>> something, and wake that up at the end of backup_do_cow()? >>> >> >> 1. Because if we _can_ allow doubling of memory, it's more effective to not >> restrict allocations on >> guest writes. It's just seems to be more effective technique. > > But the problem with backup and zero writes/discards is that the memory > is not doubled. The request doesn’t need any memory, but the CBW > operation does, and maybe lots of it. > > So the guest may issue many zero writes/discards in parallel and thus > exhaust memory on the host.
So this is the reason to separate writes from write-zeros/discrads. So at least write will be happy. And I think that write is more often request than write-zero/discard > >> 2. Anyway, I'd allow some always-available size to allocate - let it be one >> cluster, which will correspond >> to current behavior and prevent guest io hang in worst case. > > The guest would only hang if it we have to copy more than e.g. 64 MB at > a time. At which point I think it’s not unreasonable to sequentialize > requests. > -- Best regards, Vladimir