On Fri, Jul 31, 2020 at 3:52 PM Lukas Straub <lukasstra...@web.de> wrote: > > On Sun, 21 Jun 2020 10:10:03 +0800 > Derek Su <dere...@qnap.com> wrote: > > > This series is to reduce the guest's downtime during colo checkpoint > > by migrating dirty ram pages as many as possible before colo checkpoint. > > > > If the iteration count reaches COLO_RAM_MIGRATE_ITERATION_MAX or > > ram pending size is lower than 'x-colo-migrate-ram-threshold', > > stop the ram migration and do colo checkpoint. > > > > Test environment: > > The both primary VM and secondary VM has 1GiB ram and 10GbE NIC > > for FT traffic. > > One fio buffer write job runs on the guest. > > The result shows the total primary VM downtime is decreased by ~40%. > > > > Please help to review it and suggestions are welcomed. > > Thanks. > > Hello Derek, > Sorry for the late reply. > I think this is not a good idea, because it unnecessarily introduces a delay > between checkpoint request and the checkpoint itself and thus impairs network > bound workloads due to increased network latency. Workloads that are > independent from network don't cause many checkpoints anyway, so it doesn't > help there either. >
Hello, Lukas & Zhanghailiang Thanks for your opinions. I went through my patch, and I feel a little confused and would like to dig into it more. In this patch, colo_migrate_ram_before_checkpoint() is before COLO_MESSAGE_CHECKPOINT_REQUEST, so the SVM and PVM should not enter the pause state. In the meanwhile, the packets to PVM/SVM can still be compared and notify inconsistency if mismatched, right? Is it possible to introduce extra network latency? In my test (randwrite to disk by fio with direct=0), the ping from another client to the PVM using generic colo and colo used this patch are below. The network latency does not increase as my expectation. generic colo ``` 64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=28.109 ms 64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=16.747 ms 64 bytes from 192.168.80.18: icmp_seq=89 ttl=64 time=2388.779 ms <----checkpoint start 64 bytes from 192.168.80.18: icmp_seq=90 ttl=64 time=1385.792 ms 64 bytes from 192.168.80.18: icmp_seq=91 ttl=64 time=384.896 ms <----checkpoint end 64 bytes from 192.168.80.18: icmp_seq=92 ttl=64 time=3.895 ms 64 bytes from 192.168.80.18: icmp_seq=93 ttl=64 time=1.020 ms 64 bytes from 192.168.80.18: icmp_seq=94 ttl=64 time=0.865 ms 64 bytes from 192.168.80.18: icmp_seq=95 ttl=64 time=0.854 ms 64 bytes from 192.168.80.18: icmp_seq=96 ttl=64 time=28.359 ms 64 bytes from 192.168.80.18: icmp_seq=97 ttl=64 time=12.309 ms 64 bytes from 192.168.80.18: icmp_seq=98 ttl=64 time=0.870 ms 64 bytes from 192.168.80.18: icmp_seq=99 ttl=64 time=2371.733 ms 64 bytes from 192.168.80.18: icmp_seq=100 ttl=64 time=1371.440 ms 64 bytes from 192.168.80.18: icmp_seq=101 ttl=64 time=366.414 ms 64 bytes from 192.168.80.18: icmp_seq=102 ttl=64 time=0.818 ms 64 bytes from 192.168.80.18: icmp_seq=103 ttl=64 time=0.997 ms ``` colo used this patch ``` 64 bytes from 192.168.80.18: icmp_seq=72 ttl=64 time=1.417 ms 64 bytes from 192.168.80.18: icmp_seq=73 ttl=64 time=0.931 ms 64 bytes from 192.168.80.18: icmp_seq=74 ttl=64 time=0.876 ms 64 bytes from 192.168.80.18: icmp_seq=75 ttl=64 time=1184.034 ms <----checkpoint start 64 bytes from 192.168.80.18: icmp_seq=76 ttl=64 time=181.297 ms <----checkpoint end 64 bytes from 192.168.80.18: icmp_seq=77 ttl=64 time=0.865 ms 64 bytes from 192.168.80.18: icmp_seq=78 ttl=64 time=0.858 ms 64 bytes from 192.168.80.18: icmp_seq=79 ttl=64 time=1.247 ms 64 bytes from 192.168.80.18: icmp_seq=80 ttl=64 time=0.946 ms 64 bytes from 192.168.80.18: icmp_seq=81 ttl=64 time=0.855 ms 64 bytes from 192.168.80.18: icmp_seq=82 ttl=64 time=0.868 ms 64 bytes from 192.168.80.18: icmp_seq=83 ttl=64 time=0.749 ms 64 bytes from 192.168.80.18: icmp_seq=84 ttl=64 time=2.154 ms 64 bytes from 192.168.80.18: icmp_seq=85 ttl=64 time=1499.186 ms 64 bytes from 192.168.80.18: icmp_seq=86 ttl=64 time=496.173 ms 64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=0.854 ms 64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=0.774 ms ``` Thank you. Regards, Derek > Hailang did have a patch to migrate ram between checkpoints, which should > help all workloads, but it wasn't merged back then. I think you can pick it > up again, rebase and address David's and Eric's comments: > https://lore.kernel.org/qemu-devel/20200217012049.22988-3-zhang.zhanghaili...@huawei.com/T/#u > > Hailang, are you ok with that? > > Regards, > Lukas Straub