> >>> I wonder if it is the scanning for zeros or sending the whiteout > >>> which affects the total migration time more. If it is the former > >>> (as I would > >>> expect) then a rather local change to is_zero_range() to make use of > >>> the mapping information before scanning would get you all the > >>> speedups without protocol changes, interfering with postcopy etc. > >>> > >>> Roman. > >>> > >> > >> Localizing the solution to zero page scan check is a good idea. I too > >> agree that most of the time is send in scanning for zero page in > >> which case we should be able to localize solution to is_zero_range(). > >> However in case of ballooned out pages (which can be seen as a subset > >> of guest zero pages) we also spend a very small portion of total > >> migration time in sending the control information, which can be also > avoided. > >> From my tests for 16GB idle guest of which 12GB was ballooned out, > >> the zero page scan time for 12GB ballooned out pages was ~1789 ms and > >> save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out > >> pages was ~556 ms. Total migration time was ~8000 ms > > > > How did you do the tests? ~ 556ms seems too long for putting several > bytes to the buffer. > > It's likely the time you measured contains the portion to processes the > other 4GB guest memory pages. > > > > Liang > > > > I modified save_zero_page() as below and updated timers only for ballooned > out pages so is_zero_page() should return true(also > qemu_balloon_bitmap_test() from my patchset returned 1) With below > instrumentation, I got t1 = ~1789ms and t2 = ~556ms. Also the total migration > time noted (~8000ms) is for unmodified qemu source.
You mean the total live migration time for the unmodified qemu and the 'you modified for test' qemu are almost the same? > It seems to addup to final migration time with proposed patchset. > > Here is the last entry for "another round" of test, this time its ~547ms > JK: block=7f5417a345e0, offset=3ffe42020, zero_page_scan_time=1218 us, > save_page_header_time=184 us, total_save_zero_page_time=1453 us > cumulated vals: zero_page_scan_time=1723920378 us, > save_page_header_time=547514618 us, > total_save_zero_page_time=2371059239 us > > static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset, > uint8_t *p, uint64_t *bytes_transferred) { > int pages = -1; > int64_t time1, time2, time3, time4; > static int64_t t1 = 0, t2 = 0, t3 = 0; > > time1 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); > if (is_zero_range(p, TARGET_PAGE_SIZE)) { > time2 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); > acct_info.dup_pages++; > *bytes_transferred += save_page_header(f, block, > offset | > RAM_SAVE_FLAG_COMPRESS); > qemu_put_byte(f, 0); > time3 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); > *bytes_transferred += 1; > pages = 1; > } > time4 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); > > if (qemu_balloon_bitmap_test(block, offset) == 1) { > t1 += (time2-time1); > t2 += (time3-time2); > t3 += (time4-time1); > fprintf(stderr, "block=%lx, offset=%lx, zero_page_scan_time=%ld us, > save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n" > "cumulated vals: zero_page_scan_time=%ld us, > save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n", > (unsigned long)block, (unsigned long)offset, > (time2-time1), (time3-time2), (time4-time1), t1, > t2, t3); > } > return pages; > } > Thanks for your description. The issue here is that there are too many qemu_clock_get_ns() call, the cost of the function itself may become the main time consuming operation. You can measure the time consumed by the qemu_clock_get_ns() you added for test by comparing the result with the version which not add the qemu_clock_get_ns(). Liang