Well the issue is not new, anyhow, following a conversation with Orit ...
Since we want the migration to finish, I believe that the "migration
speed" parameter alone cannot do the job.
I suggest using two distinct parameters:
1. Migration speed - will be used to limit the network resources utilization
2. aggressionLevel - A number between 0.0 and 1.0, where low values
imply minimal interruption to the guest, and 1.0 mean that the guest
will be completely stalled.
In any case the migration will have to do its work and finish given any
actual migration-speed, so even low aggressionLevel values will
sometimes imply that the guest will be throttled substantially.
The algorithm:
The aggressionLevel should determine the targetGuest%CPU (how much CPU
time we want to allocate to the guest)
With aggressionLevel = 1.0, the guest gets no CPU-resources (stalled).
With aggressionLevel = 0.0, the guest gets minGuest%CPU, such that
migrationRate == dirtyPagesRate. This minGuest%CPU is continuously
updated based on the running average of the recent samples (more below).
Note that the targetGuest%CPU allocation is continuously updated due to
changes guest behavior, network congestion, and alike.
Some more details
- minGuest%CPU (i.e., for dirtyPagesRate == migrationRate) is easy to
calculate as a running average of
(migrationRate / dirtyPagesRate * guest%CPU)
- There are several methods to calculate the running average, my
favorite is IIR, where, roughly speaking,
newVal = 0.99 * oldVal + 0.01 * newSample
- I would use two measures to ensure that there are more migrated pages
than "dirty" pages.
1. The running average (based on recent samples) of the migrated
pages is larger than that of the new dirty pages
2. The total number of migrated pages so far is larger than the total
number of new dirty pages.
And yes, many details are still missing.
Ronen.