On 3/11/2016 4:24 PM, Li, Liang Z wrote:
I wonder if it is the scanning for zeros or sending the whiteout which affects the total migration time more. If it is the former (as I would expect) then a rather local change to is_zero_range() to make use of the mapping information before scanning would get you all the speedups without protocol changes, interfering with postcopy etc.Roman.Localizing the solution to zero page scan check is a good idea. I too agree that most of the time is send in scanning for zero page in which case we should be able to localize solution to is_zero_range(). However in case of ballooned out pages (which can be seen as a subset of guest zero pages) we also spend a very small portion of total migration time in sending the control information, which can be alsoavoided.From my tests for 16GB idle guest of which 12GB was ballooned out, the zero page scan time for 12GB ballooned out pages was ~1789 ms and save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out pages was ~556 ms. Total migration time was ~8000 msHow did you do the tests? ~ 556ms seems too long for putting severalbytes to the buffer.It's likely the time you measured contains the portion to processes theother 4GB guest memory pages.LiangI modified save_zero_page() as below and updated timers only for ballooned out pages so is_zero_page() should return true(also qemu_balloon_bitmap_test() from my patchset returned 1) With below instrumentation, I got t1 = ~1789ms and t2 = ~556ms. Also the total migration time noted (~8000ms) is for unmodified qemu source.You mean the total live migration time for the unmodified qemu and the 'you modified for test' qemu are almost the same?
Not sure I understand the question, but if 'you modified for test' means below modifications to save_zero_page(), then answer is no. Here is what I tried, let’s say we have 3 versions of qemu (below timings are for 16GB idle guest with 12GB ballooned out)
v1. Unmodified qemu – absolutely not code change – Total Migration time = ~7600ms (I rounded this one to ~8000ms) v2. Modified qemu 1 – with proposed patch set (which skips both zero pages scan and migrating control information for ballooned out pages) - Total Migration time = ~5700ms v3. Modified qemu 2 – only with changes to save_zero_page() as discussed in previous mail (and of course using proposed patch set only to maintain bitmap for ballooned out pages) – Total migration time is irrelevant in this case.
Total Zero page scan time = ~1789ms Total (save_page_header + qemu_put_byte(f, 0)) = ~556ms. Everything seems to add up here (may not be exact) – 5700+1789+559 = ~8000msI see 2 factors that we have not considered in this add up a. overhead for migrating balloon bitmap to target and b. as you mentioned below overhead of qemu_clock_get_ns().
It seems to addup to final migration time with proposed patchset. Here is the last entry for "another round" of test, this time its ~547ms JK: block=7f5417a345e0, offset=3ffe42020, zero_page_scan_time=1218 us, save_page_header_time=184 us, total_save_zero_page_time=1453 us cumulated vals: zero_page_scan_time=1723920378 us, save_page_header_time=547514618 us, total_save_zero_page_time=2371059239 us static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset, uint8_t *p, uint64_t *bytes_transferred) { int pages = -1; int64_t time1, time2, time3, time4; static int64_t t1 = 0, t2 = 0, t3 = 0; time1 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); if (is_zero_range(p, TARGET_PAGE_SIZE)) { time2 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); acct_info.dup_pages++; *bytes_transferred += save_page_header(f, block, offset | RAM_SAVE_FLAG_COMPRESS); qemu_put_byte(f, 0); time3 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); *bytes_transferred += 1; pages = 1; } time4 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); if (qemu_balloon_bitmap_test(block, offset) == 1) { t1 += (time2-time1); t2 += (time3-time2); t3 += (time4-time1); fprintf(stderr, "block=%lx, offset=%lx, zero_page_scan_time=%ld us, save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n" "cumulated vals: zero_page_scan_time=%ld us, save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n", (unsigned long)block, (unsigned long)offset, (time2-time1), (time3-time2), (time4-time1), t1, t2, t3); } return pages; }Thanks for your description. The issue here is that there are too many qemu_clock_get_ns() call, the cost of the function itself may become the main time consuming operation. You can measure the time consumed by the qemu_clock_get_ns() you added for test by comparing the result with the version which not add the qemu_clock_get_ns(). Liang
Yes, we can try to measure overhead for qemu_clock_get_ns() calls and see if things add up perfectly.
Thanks, - Jitendra