Hi, Andrei, On Mon, Jul 31, 2023 at 05:51:49PM +0300, gudkov.and...@huawei.com wrote: > On Mon, Jul 17, 2023 at 03:08:37PM -0400, Peter Xu wrote: > > On Tue, Jul 11, 2023 at 03:38:18PM +0300, gudkov.and...@huawei.com wrote: > > > On Thu, Jul 06, 2023 at 03:23:43PM -0400, Peter Xu wrote: > > > > On Thu, Jun 29, 2023 at 11:59:03AM +0300, Andrei Gudkov wrote: > > > > > Introduces alternative argument calc-time-ms, which is the > > > > > the same as calc-time but accepts millisecond value. > > > > > Millisecond precision allows to make predictions whether > > > > > migration will succeed or not. To do this, calculate dirty > > > > > rate with calc-time-ms set to max allowed downtime, convert > > > > > measured rate into volume of dirtied memory, and divide by > > > > > network throughput. If the value is lower than max allowed > > > > > downtime, then migration will converge. > > > > > > > > > > Measurement results for single thread randomly writing to > > > > > a 24GiB region: > > > > > +--------------+--------------------+ > > > > > | calc-time-ms | dirty-rate (MiB/s) | > > > > > +--------------+--------------------+ > > > > > | 100 | 1880 | > > > > > | 200 | 1340 | > > > > > | 300 | 1120 | > > > > > | 400 | 1030 | > > > > > | 500 | 868 | > > > > > | 750 | 720 | > > > > > | 1000 | 636 | > > > > > | 1500 | 498 | > > > > > | 2000 | 423 | > > > > > +--------------+--------------------+ > > > > > > > > Do you mean the dirty workload is constant? Why it differs so much with > > > > different calc-time-ms? > > > > > > Workload is as constant as it could be. But the naming is misleading. > > > What is named "dirty-rate" in fact is not "rate" at all. > > > calc-dirty-rate measures number of *uniquely* dirtied pages, i.e. each > > > page can contribute to the counter only once during measurement period. > > > That's why the values are decreasing. Consider also ad infinitum argument: > > > since VM has fixed number of pages and each page can be dirtied only once, > > > dirty-rate=number-of-dirtied-pages/calc-time -> 0 as calc-time -> inf. > > > It would make more sense to report number as "dirty-volume" -- > > > without dividing it by calc-time. > > > > > > Note that number of *uniquely* dirtied pages in given amount of time is > > > exactly what we need for doing migration-related predictions. There is > > > no error here. > > > > Is calc-time-ms the duration of the measurement? > > > > Taking the 1st line as example, 1880MB/s * 0.1s = 188MB. > > For the 2nd line, 1340MB/s * 0.2s = 268MB. > > Even for the longest duration of 2s, that's 846MB in total. > > > > The range is 24GB. In this case, most of the pages should only be written > > once even if random for all these test durations, right? > > > > Yes, I messed with load generator. > The effective memory region was much smaller than 24GiB. > I performed more testing (after fixing load generator), > now with different memory sizes and different modes. > > +--------------+-----------------------------------------------+ > | calc-time-ms | dirty rate MiB/s | > | +----------------+---------------+--------------+ > | | theoretical | page-sampling | dirty-bitmap | > | | (at 3M wr/sec) | | | > +--------------+----------------+---------------+--------------+ > | 1GiB | > +--------------+----------------+---------------+--------------+ > | 100 | 6996 | 7100 | 3192 | > | 200 | 4606 | 4660 | 2655 | > | 300 | 3305 | 3280 | 2371 | > | 400 | 2534 | 2525 | 2154 | > | 500 | 2041 | 2044 | 1871 | > | 750 | 1365 | 1341 | 1358 | > | 1000 | 1024 | 1052 | 1025 | > | 1500 | 683 | 678 | 684 | > | 2000 | 512 | 507 | 513 | > +--------------+----------------+---------------+--------------+ > | 4GiB | > +--------------+----------------+---------------+--------------+ > | 100 | 10232 | 8880 | 4070 | > | 200 | 8954 | 8049 | 3195 | > | 300 | 7889 | 7193 | 2881 | > | 400 | 6996 | 6530 | 2700 | > | 500 | 6245 | 5772 | 2312 | > | 750 | 4829 | 4586 | 2465 | > | 1000 | 3865 | 3780 | 2178 | > | 1500 | 2694 | 2633 | 2004 | > | 2000 | 2041 | 2031 | 1789 | > +--------------+----------------+---------------+--------------+ > | 24GiB | > +--------------+----------------+---------------+--------------+ > | 100 | 11495 | 8640 | 5597 | > | 200 | 11226 | 8616 | 3527 | > | 300 | 10965 | 8386 | 2355 | > | 400 | 10713 | 8370 | 2179 | > | 500 | 10469 | 8196 | 2098 | > | 750 | 9890 | 7885 | 2556 | > | 1000 | 9354 | 7506 | 2084 | > | 1500 | 8397 | 6944 | 2075 | > | 2000 | 7574 | 6402 | 2062 | > +--------------+----------------+---------------+--------------+ > > Theoretical values are computed according to the following formula: > size * (1 - (1-(4096/size))^(time*wps)) / (time * 2^20),
Thanks for more testings and the statistics. I had a feeling that this formula may or may not be accurate, but that's less of an issue here. > where size is in bytes, time is in seconds, and wps is number of > writes per second (I measured approximately 3000000 on my system). > > Theoretical values and values obtained with page-sampling are > approximately close (<=25%). Dirty-bitmap values are much lower, > likely because the majority of writes cause page faults. Even though > dirty-bitmap logic is closer to what is happening during live > migration, I still favor page sampling because the latter doesn't > impact the performance of VM too much. Do you really use page samplings in production? I don't remember I mentioned it anywhere before, but it will provide very wrong number when the memory updates has a locality, afaik. For example, when 4G VM only has 1G actively updated, the result can be 25% of reality iiuc, seeing that the rest 3G didn't even change. It works only well with very distributed memory updates. > > Whether calc-time < 1sec is meaningful or not depends on the size > of memory region with active writes. > 1. If we have big VM and writes are evenly spread over the whole > address space, then almost all writes will go into unique pages. > In this case number of dirty pages will grow approximately > linearly with time for small calc-time values. > 2. But if memory region with active writes is small enough, then many > writes will go to the same page, and the number of dirty pages > will grow sublinearly even for small calc-time values. Note that > the second scenario can happen even VM RAM is big. For example, > imagine 128GiB VM with in-memory database that is used for reading. > Although VM size is big, the memory region with active writes is > just the application stack. No issue here to support small calc-time. I think as long as it'll be worthwhile in any use case I'd be fine with it (rather than working for all use cases). Not a super high bar to maintain the change. I copied Yong too, he just volunteered to look after the dirtyrate stuff. Thanks, -- Peter Xu