Hi, I issued a bug report (626781) months ago. Any news on it? By the way, I think it is really a bug other than a question.
Regards, eslay ** Changed in: qemu Status: New => Invalid ** Converted to question: https://answers.launchpad.net/qemu/+question/132364 -- Live migration: bandwitdth calculation and rate limiting not working https://bugs.launchpad.net/bugs/626781 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: Invalid Bug description: I am using QEMU 0.12.5 to perform live migration between 2 Linux hosts. One Linux Host has 6 cores and 24G RAM, the other has 2 cores and 16G RAM. For each host, I have one Ethernet interface for NFS storage, another interface for live migration and a third interface for the VM to communicate to outside network. Each interface has 1G bandwidth. It is observed that programs like below (which generates dirty pages very quickly) will hang the live migration: #include <stdio.h> #include <stdlib.h> main() { unsigned char *array; long int i,j,k; unsigned char c; long int loop=0; array=malloc(1024*1024*1024); while(1) { for(i=0;i<1024;i++) { c=0; for(j=0;j<1024;j++) { c++; for(k=0;k<1024;k++) { array[i*1024*1024+j*1024+k]=c; } } } loop++; if(loop%256==0) printf("%ld\n",loop); } } It is observed that the traffic down time (measured by "ping -f" from a 3rd host) has dependency on RAM size of the virtual machine: RAM Size Traffic Down Time Total Migration Time 1024M 0.5s 33s 2048M 0.7s 34s 4096M 2.7s 39s 8912M 5.3s 45s 16384M 7.2s 61s Using the command "migrate_set_downtime" in QEMU console won't improve the problem. Function ram_save_live() in "vl.c" shows that live migration has three stages: Stage 1 is some preparation work. Stage 2 is to transfer VM RAM to target host and keep the VM alive at source host. In Stage 2, the realtime migration bandwidth is calculated (Line 3099~3117 in vl.c). At the end of Stage 2 (Line 3130), the expected left time of RAM transmission is calculated (Left RAM Size / Calculated Bandwidth). If the expected left time is less than the max migration down time, Stage 2 is ended and Stage 3 starts. Stage 3 is to stop the VM at the source host, transfer left RAM at full speed, and then start the VM at the target host. The period of Stage 3 is believed to be the period when the outside lost connection to the VM. This is how live migration is supposed to work. There is a parameter max_throttle in "migration.c", which sets the max allowed bandwidth for rate limiting. The default value of this parameter is 32Mb/s (if not using command "migrate_set_speed" to change the value). But it does not matter because the rate limiting faction does not work anyway. There is another parameter max_downtime in "migration.c", which sets the max allowed traffic down time for live migration.By default the value is set to 30ms (if not using command "migrate_set_downtime" to change the value). This value to way too small, so if using the source code above, live migration will hang. Stage 2 will never end since the expected left time would never be less than 30ms. However, changing the parameter to something like 1000ms will solve the hanging problem. After changing the default value of max_downtime, the long traffic down time problem still exits. The following faults are found: a) The bandwidth calculation in ram_save_live() (the first attachment) is wrong. The bandwidth should equal to data transferred divided by the period of transmission time. The period of transmission time should be the interval between two consecutive calls of function ram_save_live(), which is usually 100ms (There should be a timer interrupt to control this). However, what the code use is the execution time of the while loop between Line 3102 and 3109. That is usually 2~5ms! This will yield to unreasonable large bandwidth (6~12Gb/s), and in turn will make the estimated execution time of Stage 3 inaccurate. For example, if the estimated execution time of Stage 3 is 900ms, the actual execution time can be like10s! b) The rate limiting function (qemu_file_rate_limit() which calls buffered_rate_limit() in "buffered_file.c") does not work at all. No matter what parameters are set, the rate limiting function behaves the same: in Stage 2, in most time the migration bandwidth is ~400 Mb/s. When a certain condition is fulfilled (I don't know exactly what condition but definitely not the number of iteration times), QEMU will read the VM RAM at full speed and throw everything to the Ethernet link. This stalls the CPU and extend the execution time of ram_save_live() to up to 6 seconds. Correspondingly, 6 seconds of traffic downtime is seen during Stage 2. (What the algorithm assumes is actually no traffic down time during Stage 2). So the fundamental functions do not work when it comes to traffic down time of live migration. However, problem a) and b) make the algorithm very easy to enter Stage 3. The calculated bandwidth is ridiculously large, so as long as the max down time is not set to a value like 30ms and no extensive memory modification during migration, things will be finished. I have a dirty fix to the problem. I assume in Stage 2, ram_save_live() is called every 100ms and each time ram_save_live() is called, no more than 100Mb data can be transferred: static int count=0; static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) { ram_addr_t addr; uint64_t bytes_transferred_last; double bwidth = 0; uint64_t expected_time = 0; // int64_t interval=0; bool flag=true; uint64_t bytes_transferred2=0; if (stage < 0) { cpu_physical_memory_set_dirty_tracking(0); return 0; } // printf("ram_save_live: stage= %d\n",stage); if (cpu_physical_sync_dirty_bitmap(0, TARGET_PHYS_ADDR_MAX) != 0) { qemu_file_set_error(f); return 0; } if (stage == 1) { bytes_transferred = 0; /* Make sure all dirty bits are set */ for (addr = 0; addr < last_ram_offset; addr += TARGET_PAGE_SIZE) { if (!cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) cpu_physical_memory_set_dirty(addr); } /* Enable dirty memory tracking */ cpu_physical_memory_set_dirty_tracking(1); qemu_put_be64(f, last_ram_offset | RAM_SAVE_FLAG_MEM_SIZE); } bytes_transferred_last = bytes_transferred; bwidth = get_clock(); while(bytes_transferred2<=1024*1024*1024/8/10) { // while ((!qemu_file_rate_limit(f))&&(bytes_transferred2<=1024*1024*1024/8/10)) { int ret; ret = ram_save_block(f); bytes_transferred += ret * TARGET_PAGE_SIZE; bytes_transferred2 += ret * TARGET_PAGE_SIZE; if (ret == 0) /* no more blocks */ break; } count ++; bwidth = get_clock()-bwidth; if(bwidth<100000000) { bwidth=100000000; flag=false; } if(flag) printf("ram_save_live: interval = %ld ms, count= %d\n",(int64_t)bwidth/1000000,count); bwidth = (bytes_transferred - bytes_transferred_last) / bwidth ; /* if we haven't transferred anything this round, force expected_time to a * a very high value, but without crashing */ if (bwidth == 0) bwidth = 0.000001; if (bwidth > 1024*1024*1024/1000000000/8) bwidth = 1.024/8; /* try transferring iterative blocks of memory */ if (stage == 3) { /* flush all remaining blocks regardless of rate limiting */ while (ram_save_block(f) != 0) { bytes_transferred += TARGET_PAGE_SIZE; } cpu_physical_memory_set_dirty_tracking(0); } qemu_put_be64(f, RAM_SAVE_FLAG_EOS); expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth; printf("ram_save_live: stage = %d, bwidth = %lf Mb/s, expected_time = %ld ms, migrate_max_downtime = %ld ms\n",stage,bwidth*1000*8,expected_time/1000000, migrate_max_downtime()/1000000); return (stage == 2) && (expected_time <= migrate_max_downtime()); } For an empty VM, the dirty fix extends the total migration time to ~2 minutes (with 15G RAM). But the traffic down time can be controlled to ~1 second. The actual migration bandwidth is ~700Mb/s (all the time). This fix is very environment specific (won't work with, say, with 10G link). A thorough fix is needed for this problem.