On Fri, Dec 03, 2021 at 09:39:46AM +0800, huang...@chinatelecom.cn wrote:
+static uint64_t dirtylimit_pct(unsigned int last_pct,
+ uint64_t quota,
+ uint64_t current)
+{
+ uint64_t limit_pct = 0;
+ RestrainPolicy policy;
+ bool mitigate = (quota > current) ? true : false;
+
+ if (mitigate && ((current == 0) ||
+ (last_pct <= DIRTYLIMIT_THROTTLE_SLIGHT_STEP_SIZE))) {
+ return 0;
+ }
+
+ policy = dirtylimit_policy(last_pct, quota, current);
+ switch (policy) {
+ case RESTRAIN_SLIGHT:
+ /* [90, 99] */
+ if (mitigate) {
+ limit_pct =
+ last_pct - DIRTYLIMIT_THROTTLE_SLIGHT_STEP_SIZE;
+ } else {
+ limit_pct =
+ last_pct + DIRTYLIMIT_THROTTLE_SLIGHT_STEP_SIZE;
+
+ limit_pct = MIN(limit_pct, CPU_THROTTLE_PCT_MAX);
+ }
+ break;
+ case RESTRAIN_HEAVY:
+ /* [75, 90) */
+ if (mitigate) {
+ limit_pct =
+ last_pct - DIRTYLIMIT_THROTTLE_HEAVY_STEP_SIZE;
+ } else {
+ limit_pct =
+ last_pct + DIRTYLIMIT_THROTTLE_HEAVY_STEP_SIZE;
+
+ limit_pct = MIN(limit_pct,
+ DIRTYLIMIT_THROTTLE_SLIGHT_WATERMARK);
+ }
+ break;
+ case RESTRAIN_RATIO:
+ /* [0, 75) */
+ if (mitigate) {
+ if (last_pct <= (((quota - current) * 100 / quota))) {
+ limit_pct = 0;
+ } else {
+ limit_pct = last_pct -
+ ((quota - current) * 100 / quota);
+ limit_pct = MAX(limit_pct, CPU_THROTTLE_PCT_MIN);
+ }
+ } else {
+ limit_pct = last_pct +
+ ((current - quota) * 100 / current);
+
+ limit_pct = MIN(limit_pct,
+ DIRTYLIMIT_THROTTLE_HEAVY_WATERMARK);
+ }
+ break;
+ case RESTRAIN_KEEP:
+ default:
+ limit_pct = last_pct;
+ break;
+ }
+
+ return limit_pct;
+}
+
+static void *dirtylimit_thread(void *opaque)
+{
+ int cpu_index = *(int *)opaque;
+ uint64_t quota_dirtyrate, current_dirtyrate;
+ unsigned int last_pct = 0;
+ unsigned int pct = 0;
+
+ rcu_register_thread();
+
+ quota_dirtyrate = dirtylimit_quota(cpu_index);
+ current_dirtyrate = dirtylimit_current(cpu_index);
+
+ pct = dirtylimit_init_pct(quota_dirtyrate, current_dirtyrate);
+
+ do {
+ trace_dirtylimit_impose(cpu_index,
+ quota_dirtyrate, current_dirtyrate, pct);
+
+ last_pct = pct;
+ if (pct == 0) {
+ sleep(DIRTYLIMIT_CALC_PERIOD_TIME_S);
+ } else {
+ dirtylimit_check(cpu_index, pct);
+ }
+
+ quota_dirtyrate = dirtylimit_quota(cpu_index);
+ current_dirtyrate = dirtylimit_current(cpu_index);
+
+ pct = dirtylimit_pct(last_pct, quota_dirtyrate,
current_dirtyrate);
So what I had in mind is we can start with an extremely simple version of
negative feedback system. Say, firstly each vcpu will have a simple
number to
sleep for some interval (this is ugly code, but just show what I
meant..):
===============
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index eecd8031cf..c320fd190f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2932,6 +2932,8 @@ int kvm_cpu_exec(CPUState *cpu)
trace_kvm_dirty_ring_full(cpu->cpu_index);
qemu_mutex_lock_iothread();
kvm_dirty_ring_reap(kvm_state);
+ if (dirtylimit_enabled(cpu->cpu_index) &&
cpu->throttle_us_per_full)
+ usleep(cpu->throttle_us_per_full);
qemu_mutex_unlock_iothread();
ret = 0;
break;
===============
I think this will have finer granularity when throttle (for 4096 ring
size,
that's per-16MB operation) than current way where we inject per-vcpu
async task
to sleep, like auto-converge.
Then we have the "black box" to tune this value with below input/output:
- Input: dirty rate information, same as current algo
- Output: increase/decrease of per-vcpu throttle_us_per_full above,
and
that's all
We can do the sampling per-second, then we keep doing it: we can have
1 thread
doing per-second task collecting dirty rate information for all the
vcpus, then
tune that throttle_us_per_full for each of them.
The simplest linear algorithm would be as simple as (for each vcpu):
if (quota < current)
throttle_us_per_full += SOMETHING;
if (throttle_us_per_full > MAX)
throttle_us_per_full = MAX;
else
throttle_us_per_full -= SOMETHING;
if (throttle_us_per_full < 0)
throttle_us_per_full = 0;
I think your algorithm is fine, but thoroughly review every single bit
of it in
one shot will be challenging, and it's also hard to prove every bit of
the
algorithm is helpful, as there're a lot of hand-made macros and state
changes.
I actually tested the current algorithm of yours, the dirty rate
fluctuates a
bit (when I specified 200MB/s, it can go into either a few tens of
MB/s or
300MB/s, normally less), neither does it respond fast (the initial
throtle from
500MB/s -> 200MB/s should need 1 minute or something), so it seems not
ideal
anyway. In that case I prefer we start with simple.
So IMHO we can start with this simple scheme first then it'll start
working
with much less line of codes, afaict. With that scheme ready in the
1st or
initial patches, it'll be easier to either apply any better algorithm
(e.g. your current one, if you're confident with that) or other things
then
it'll be much easier to review too if you could consider split your
patch like
that.
Normally per my knowledge for the need on migration, we could consider
add an
integral algorithm into this linear algorithm that I said above, and
it should
help us reach a very stable and constant state of throttling already.
But
we'll need to try it out, as I never tried.
What do you think?