09.08.2019 18:59, Eric Blake wrote: > On 8/9/19 10:32 AM, Vladimir Sementsov-Ogievskiy wrote: >> backup_cow_with_offload can transfer more than on cluster. Let > > s/on/one/ > >> backup_cow_with_bounce_buffer behave similarly. It reduces number >> of IO and there are no needs to copy cluster by cluster. > > It reduces the number of IO requests, since there is no need to copy > cluster by cluster. > >> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> >> --- >> block/backup.c | 29 +++++++++++++++-------------- >> 1 file changed, 15 insertions(+), 14 deletions(-) >> >> diff --git a/block/backup.c b/block/backup.c >> index d482d93458..155e21d0a3 100644 >> --- a/block/backup.c >> +++ b/block/backup.c >> @@ -104,22 +104,25 @@ static int coroutine_fn >> backup_cow_with_bounce_buffer(BackupBlockJob *job, >> int64_t start, >> int64_t end, >> bool >> is_write_notifier, >> - bool *error_is_read, >> - void **bounce_buffer) >> + bool *error_is_read) > > Why is this signature changing? > >> { >> int ret; >> BlockBackend *blk = job->common.blk; >> int nbytes; >> int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0; >> + void *bounce_buffer; >> >> assert(QEMU_IS_ALIGNED(start, job->cluster_size)); >> - bdrv_reset_dirty_bitmap(job->copy_bitmap, start, job->cluster_size); >> - nbytes = MIN(job->cluster_size, job->len - start); >> - if (!*bounce_buffer) { >> - *bounce_buffer = blk_blockalign(blk, job->cluster_size); > > Pre-patch, you allocate the bounce_buffer at most once (but limited to a > cluster size), post-patch, you are now allocating and freeing a bounce > buffer every iteration through. There may be fewer calls because you > are allocating something bigger, but still, isn't it a good goal to try > and allocate the bounce buffer as few times as possible and reuse it, > rather than getting a fresh one each iteration?
Yes, it's a "degradation" of this patch, I was afraid of this question. However, I doubt that it should be optimized: 1. I'm going to run several copy requests in coroutines to parallelize copying loop, to improve performance (series for qcow2 are here https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg06654.html), so we'll need several buffers for parallel copying requests and it's extremely easier to allocate buffer when it's needed and free after it, than do some allocated memory sharing like in mirror. 2. Actually it is a question about memory allocator: is it fast and optimized enough to not care, or is it bad, and we should work-around it like in mirror? And in my opinion (proved by a bit of benchmarking) memalign is fast enough to don't care. I was answering similar question in more details and with some numbers here: https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg00087.html So, I'd prefere to not care and keep code simpler. But if you don't agree, I can leave shared buffer here, at least until introduction of parallel requests. Then, it will be something like qemu_try_blockalign(MIN(bytes, 64M)).. > >> + >> + nbytes = MIN(end - start, job->len - start); > > What is the largest this will be? Will it exhaust memory on a 32-bit or > otherwise memory-constrained system, where you should have some maximum > cap for how large the bounce buffer will be while still getting better > performance than one cluster at a time and not slowing things down by > over-sizing malloc? 64M, maybe? > Hmm, good point. I thought that it's safe to allocate buffer for a full request, as if such buffer is already allocated for the guest request itself, it should not be bad to allocate another one with same size. But I forgot about write-zeros and discard operations which may be large without any allocation and here we need to allocate anyway. Hmm2, but we have parallel guest writes(discards) and therefore parallel copy-before-write operations. So total allocation is not limited anyway and it should be fixed somehow too.. -- Best regards, Vladimir