Hi, Andrei,

On Mon, Jul 31, 2023 at 05:51:49PM +0300, gudkov.and...@huawei.com wrote:
> On Mon, Jul 17, 2023 at 03:08:37PM -0400, Peter Xu wrote:
> > On Tue, Jul 11, 2023 at 03:38:18PM +0300, gudkov.and...@huawei.com wrote:
> > > On Thu, Jul 06, 2023 at 03:23:43PM -0400, Peter Xu wrote:
> > > > On Thu, Jun 29, 2023 at 11:59:03AM +0300, Andrei Gudkov wrote:
> > > > > Introduces alternative argument calc-time-ms, which is the
> > > > > the same as calc-time but accepts millisecond value.
> > > > > Millisecond precision allows to make predictions whether
> > > > > migration will succeed or not. To do this, calculate dirty
> > > > > rate with calc-time-ms set to max allowed downtime, convert
> > > > > measured rate into volume of dirtied memory, and divide by
> > > > > network throughput. If the value is lower than max allowed
> > > > > downtime, then migration will converge.
> > > > > 
> > > > > Measurement results for single thread randomly writing to
> > > > > a 24GiB region:
> > > > > +--------------+--------------------+
> > > > > | calc-time-ms | dirty-rate (MiB/s) |
> > > > > +--------------+--------------------+
> > > > > |          100 |               1880 |
> > > > > |          200 |               1340 |
> > > > > |          300 |               1120 |
> > > > > |          400 |               1030 |
> > > > > |          500 |                868 |
> > > > > |          750 |                720 |
> > > > > |         1000 |                636 |
> > > > > |         1500 |                498 |
> > > > > |         2000 |                423 |
> > > > > +--------------+--------------------+
> > > > 
> > > > Do you mean the dirty workload is constant?  Why it differs so much with
> > > > different calc-time-ms?
> > > 
> > > Workload is as constant as it could be. But the naming is misleading.
> > > What is named "dirty-rate" in fact is not "rate" at all.
> > > calc-dirty-rate measures number of *uniquely* dirtied pages, i.e. each
> > > page can contribute to the counter only once during measurement period.
> > > That's why the values are decreasing. Consider also ad infinitum argument:
> > > since VM has fixed number of pages and each page can be dirtied only once,
> > > dirty-rate=number-of-dirtied-pages/calc-time -> 0 as calc-time -> inf.
> > > It would make more sense to report number as "dirty-volume" --
> > > without dividing it by calc-time.
> > > 
> > > Note that number of *uniquely* dirtied pages in given amount of time is
> > > exactly what we need for doing migration-related predictions. There is
> > > no error here.
> > 
> > Is calc-time-ms the duration of the measurement?
> > 
> > Taking the 1st line as example, 1880MB/s * 0.1s = 188MB.
> > For the 2nd line, 1340MB/s * 0.2s = 268MB.
> > Even for the longest duration of 2s, that's 846MB in total.
> > 
> > The range is 24GB.  In this case, most of the pages should only be written
> > once even if random for all these test durations, right?
> > 
> 
> Yes, I messed with load generator.
> The effective memory region was much smaller than 24GiB.
> I performed more testing (after fixing load generator),
> now with different memory sizes and different modes.
> 
> +--------------+-----------------------------------------------+
> | calc-time-ms |                dirty rate MiB/s               |
> |              +----------------+---------------+--------------+
> |              | theoretical    | page-sampling | dirty-bitmap |
> |              | (at 3M wr/sec) |               |              |
> +--------------+----------------+---------------+--------------+
> |                             1GiB                             |
> +--------------+----------------+---------------+--------------+
> |          100 |           6996 |          7100 |         3192 |
> |          200 |           4606 |          4660 |         2655 |
> |          300 |           3305 |          3280 |         2371 |
> |          400 |           2534 |          2525 |         2154 |
> |          500 |           2041 |          2044 |         1871 |
> |          750 |           1365 |          1341 |         1358 |
> |         1000 |           1024 |          1052 |         1025 |
> |         1500 |            683 |           678 |          684 |
> |         2000 |            512 |           507 |          513 |
> +--------------+----------------+---------------+--------------+
> |                             4GiB                             |
> +--------------+----------------+---------------+--------------+
> |          100 |          10232 |          8880 |         4070 |
> |          200 |           8954 |          8049 |         3195 |
> |          300 |           7889 |          7193 |         2881 |
> |          400 |           6996 |          6530 |         2700 |
> |          500 |           6245 |          5772 |         2312 |
> |          750 |           4829 |          4586 |         2465 |
> |         1000 |           3865 |          3780 |         2178 |
> |         1500 |           2694 |          2633 |         2004 |
> |         2000 |           2041 |          2031 |         1789 |
> +--------------+----------------+---------------+--------------+
> |                             24GiB                            |
> +--------------+----------------+---------------+--------------+
> |          100 |          11495 |          8640 |         5597 |
> |          200 |          11226 |          8616 |         3527 |
> |          300 |          10965 |          8386 |         2355 |
> |          400 |          10713 |          8370 |         2179 |
> |          500 |          10469 |          8196 |         2098 |
> |          750 |           9890 |          7885 |         2556 |
> |         1000 |           9354 |          7506 |         2084 |
> |         1500 |           8397 |          6944 |         2075 |
> |         2000 |           7574 |          6402 |         2062 |
> +--------------+----------------+---------------+--------------+
> 
> Theoretical values are computed according to the following formula:
> size * (1 - (1-(4096/size))^(time*wps)) / (time * 2^20),

Thanks for more testings and the statistics.

I had a feeling that this formula may or may not be accurate, but that's
less of an issue here.

> where size is in bytes, time is in seconds, and wps is number of
> writes per second (I measured approximately 3000000 on my system).
> 
> Theoretical values and values obtained with page-sampling are
> approximately close (<=25%). Dirty-bitmap values are much lower,
> likely because the majority of writes cause page faults. Even though
> dirty-bitmap logic is closer to what is happening during live
> migration, I still favor page sampling because the latter doesn't
> impact the performance of VM too much.

Do you really use page samplings in production?  I don't remember I
mentioned it anywhere before, but it will provide very wrong number when
the memory updates has a locality, afaik.  For example, when 4G VM only has
1G actively updated, the result can be 25% of reality iiuc, seeing that the
rest 3G didn't even change.  It works only well with very distributed
memory updates.

> 
> Whether calc-time < 1sec is meaningful or not depends on the size
> of memory region with active writes.
> 1. If we have big VM and writes are evenly spread over the whole
>    address space, then almost all writes will go into unique pages.
>    In this case number of dirty pages will grow approximately
>    linearly with time for small calc-time values.
> 2. But if memory region with active writes is small enough, then many
>    writes will go to the same page, and the number of dirty pages
>    will grow sublinearly even for small calc-time values. Note that
>    the second scenario can happen even VM RAM is big. For example,
>    imagine 128GiB VM with in-memory database that is used for reading.
>    Although VM size is big, the memory region with active writes is
>    just the application stack.

No issue here to support small calc-time.  I think as long as it'll be
worthwhile in any use case I'd be fine with it (rather than working for all
use cases).  Not a super high bar to maintain the change.

I copied Yong too, he just volunteered to look after the dirtyrate stuff.

Thanks,

-- 
Peter Xu


Reply via email to