Am 22.03.2013 18:25, schrieb Paolo Bonzini: > Il 22/03/2013 13:46, Peter Lieven ha scritto: >> this is v4 of my patch series with various optimizations in >> zero buffer checking and migration tweaks. >> >> thanks especially to Eric Blake for reviewing. >> >> v4: >> - do not inline buffer_find_nonzero_offset() >> - inline can_usebuffer_find_nonzero_offset() correctly >> - readd asserts in buffer_find_nonzero_offset() as profiling >> shows they do not hurt. >> - change last occurences of scalar 8 by >> BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR >> - avoid deferencing p already in patch 5 where we >> know that the page (p) is zero >> - explicitly set bytes_sent = 0 if we skip a zero page. >> bytes_sent was 0 before, but it was not obvious. >> - add accounting information for skipped zero pages >> - fix errors reported by checkpatch.pl >> >> v3: >> - remove asserts, inline functions and add a check >> function if buffer_find_nonzero_offset() can be used. >> - use above check function in buffer_is_zero() and >> find_next_bit(). >> - use buffer_is_nonzero_offset() directly to find >> zero pages. we know that all requirements are met >> for memory pages. >> - fix C89 violation in buffer_is_zero(). >> - avoid derefencing p in ram_save_block() if we already >> know the page is zero. >> - fix initialization of last_offset in reset_ram_globals(). >> - avoid skipping pages with offset == 0 in bulk stage in >> migration_bitmap_find_and_reset_dirty(). >> - compared to v1 check for zero pages also after bulk >> ram migration as there are guests (e.g. Windows) which >> zero out large amount of memory while running. >> >> v2: >> - fix description, add trivial zero check and add asserts >> to buffer_find_nonzero_offset. >> - add a constant for the unroll factor of buffer_find_nonzero_offset >> - replace is_dup_page() by buffer_is_zero() >> - added test results to xbzrle patch >> - optimize descriptions >> >> Peter Lieven (9): >> move vector definitions to qemu-common.h >> cutils: add a function to find non-zero content in a buffer >> buffer_is_zero: use vector optimizations if possible >> bitops: use vector algorithm to optimize find_next_bit() >> migration: search for zero instead of dup pages >> migration: add an indicator for bulk state of ram migration >> migration: do not sent zero pages in bulk stage >> migration: do not search dirty pages in bulk stage >> migration: use XBZRLE only after bulk stage >> >> arch_init.c | 74 >> +++++++++++++++++++---------------------- >> hmp.c | 2 ++ >> include/migration/migration.h | 2 ++ >> include/qemu-common.h | 37 +++++++++++++++++++++ >> migration.c | 3 +- >> qapi-schema.json | 6 ++-- >> qmp-commands.hx | 3 +- >> util/bitops.c | 24 +++++++++++-- >> util/cutils.c | 50 ++++++++++++++++++++++++++++ >> 9 files changed, 155 insertions(+), 46 deletions(-) >> > I think patch 4 is a bit overengineered. I would prefer the simple > patch you had using three/four non-vectorized accesses. The setup cost > of the vectorized buffer_is_zero is quite high, and 64 bits are just > 256k RAM; if the host doesn't touch 256k RAM, it will incur the overhead. I think you are right. I was a little to eager to utilize buffer_find_nonzero_offset() as much as possible. The performance gain by unrolling was impressive enough. The gain by the vector functions is not that big that it would justify a possible slow down by the high setup costs. My testings revealed that in most cases buffer_find_nonzero_offset() returns 0 or a big offset. All the 0 return values would have increased setup costs with the vectorized version of patch 4.
> > I would prefer some more benchmarking for patch 5, but it looks ok. What would you like to see? Statistics how many pages of a real system are not zero, but zero in the first sizeof(long) bytes? > > The rest are fine, thanks! Thank you for reviewing. If we are done with this patches I will continue with the block migration optimizations next week. Peter