On Tue, Oct 1, 2024 at 11:37 PM Peter Xu <pet...@redhat.com> wrote:

> On Tue, Oct 01, 2024 at 10:18:54AM +0800, Yong Huang wrote:
> > On Tue, Oct 1, 2024 at 4:47 AM Peter Xu <pet...@redhat.com> wrote:
> >
> > > On Mon, Sep 30, 2024 at 01:14:28AM +0800, yong.hu...@smartx.com wrote:
> > > > From: Hyman Huang <yong.hu...@smartx.com>
> > > >
> > > > Currently, the convergence algorithm determines that the migration
> > > > cannot converge according to the following principle:
> > > > The dirty pages generated in current iteration exceed a specific
> > > > percentage (throttle-trigger-threshold, 50 by default) of the number
> > > > of transmissions. Let's refer to this criteria as the "dirty rate".
> > > > If this criteria is met more than or equal to twice
> > > > (dirty_rate_high_cnt >= 2), the throttle percentage increased.
> > > >
> > > > In most cases, above implementation is appropriate. However, for a
> > > > VM with high memory overload, each iteration is time-consuming.
> > > > The VM's computing performance may be throttled at a high percentage
> > > > and last for a long time due to the repeated confirmation behavior.
> > > > Which may be intolerable for some computationally sensitive software
> > > > in the VM.
> > > >
> > > > As the comment mentioned in the migration_trigger_throttle function,
> > > > in order to avoid erroneous detection, the original algorithm
> confirms
> > > > the criteria repeatedly. Put differently, the criteria does not need
> > > > to be validated again once the detection is more reliable.
> > > >
> > > > In the refinement, in order to make the detection more accurate, we
> > > > introduce another criteria, called the "dirty ratio" to determine
> > > > the migration convergence. The "dirty ratio" is the ratio of
> > > > bytes_xfer_period and bytes_dirty_period. When the algorithm
> > > > repeatedly detects that the "dirty ratio" of current sync is lower
> > > > than the previous, the algorithm determines that the migration cannot
> > > > converge. For the "dirty rate" and "dirty ratio", if one of the two
> > > > criteria is met, the penalty percentage would be increased. This
> > > > makes CPU throttle more responsively and therefor saves the time of
> > > > the entire iteration and therefore reduces the time of VM performance
> > > > degradation.
> > > >
> > > > In conclusion, this refinement significantly reduces the processing
> > > > time required for the throttle percentage step to its maximum while
> > > > the VM is under a high memory load.
> > >
> > > I'm a bit lost on why this patch 2-3 is still needed if patch 1 works.
> > > Wouldn't that greatly increase the chance of throttle code being
> inovked
> > > already?  Why we still need this?
> > >
> >
> > Indeed, if we are considering how to increase the change of throttle.
> > Patch 1 is sufficient, and I'm not insisting.
> >
> > If we are talking about how to detect the migration convergence, this
> > patch, IMHO, is still helpful. Anyway, it depends on your judgment. :)
>
> Thanks.  I really hope we can stick with patch 1 only for now, and we leave
> patches like 2-3 for future, or probably never.
>
> I want to avoid more magical tunables, and I want to avoid the code harder
> to read.  Unlike most of other migration features, auto converge so far is
> already pretty heavy on the "engineering" aspect of things.  More people
> care about downtime with 100ms or even less, then it makes zero sense a
> throttle feature can stop a group of vCPUs for more than that easily.
>
> I hope we can unite more dev/qe resources on postcopy across QEMU community
> for enterprise users.  PoCs are always good stuff for QEMU as it's a
> community project and people experiment things on it, but I hope at least
> from design level, not small tunables like this one.  We could have
> introduced 10 more tunables all over, feed them to AI and train some
> numbers that migration can improve 10%, but IMHO that doesn't hugely help.
>
> If you really care about convergence issues, I want to know whether you
> agree on postcopy being a better way to go.  There're still plenty of
>

Agree, postcopy ought to deserve more attention as respect to refining the
huge
VM migration.


> things we can do better in that area on either postcopy in general, or
> downtime optimizations that lots of people are working (e.g. VFIO's), so
> again IMHO it'll be good we keep focused there.
>
> Thanks,
>
> --
> Peter Xu
>
>
Thanks for sharing your idea, I'll drop these 2 patches in the next version.

Yong

-- 
Best regards

Reply via email to