RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests toinfluencenice priority

Tian, Kevin Thu, 17 Jul 2008 22:50:05 -0700

>From: Darrick J. Wong [mailto:[EMAIL PROTECTED] 
>Sent: 2008年7月18日 3:05
>
>If there are multiple VMs that are busy, the busy ones will fight among
>themselves for CPU time.  I still see some priority boost, just not as
>much.


some micro-level analysis is useful here.

>
>I wonder how stable the virtual tsc is...?  Will have to study this.

My point is that to expose virtual freq states doesn't change the fact
whether virtual tsc is stable, since the interception logic about virtual
freq change request only impacts nice. That's expected behavior.

Then whether a virtual tsc is stable is just another issue out of this 
feature.

>
>IDA has the same problem... the T61 BIOS "compensates" for this fakery
>by reporting a frequency of $max_freq + 1 so if you're smart 
>then you'll
>somehow know that you might see a boost that you can't measure. :P

It can be measured, as one necessary requirement pushed on any
hardware coordinated logic, to provide some type of feedback 
mechanism. For example, Intel processors provides APERF/MPERF
pairs with MPERF incremented in proportion to a fixed boot frequency,
while APERF increments tin proportion to actual performance.
Software should use APERF/MPERF to understand actual freq in
elapsed sampling period. 

>
>I suppose the problem here is that p-states were designed on the
>assumption that you're directly manipulating hardware speeds, whereas
>what we really want in both this patch and IDA are qualitative values
>("medium speed", "highest speed", "ludicrous speed?")

It's still a bit different. 

For IDA, when ludicrous speed is requested, it may be granted. 
However when it's not, the actual freq will be still at highest speed
and never be lower.

However for this feature, how much cpu cycles can be granted is
not decided by a single 'nice' value, which instead depends on num
of active vcpus at given time on given cpu. Whatever speed is 
requested, either medium, highest or ludicrous, granted cycles can
always vary from some minimal (many vcpus contends) to 100%
(only current is active).  

>On the other hand, if you get the same performance  at both 
>high and low
>speeds, then it doesn't really matter which one you choose.  At least
>not until the load changes.  I suppose the next question is, how much
>software is dependent on knowing the exact CPU frequency, and are
>workload schedulers smart enough to realize that performance
>characteristics can change over time (throttling, TM1/TM2, etc)?
>Inasmuch as you actually ever know, since with hardware coordination of
>cpufreq the hardware can do whatever it wants.

Throttling or TM1/TM2 are related to thermal when some threshold
is reached. Here let's focus on DBS (Demand Based Switching)
which is actively conducted by OSPM per workload estimation. A
typical freq demotion flow is like below:

        If (PercentBusy * Pc/Pn) < threshold
                switch Pc to Pn;

Here PercentBusy represents CPU utilization in elapsed sampling
period. Pc stands for freq used in elapsed period, and Pn is the
candidate lower freq to be changed. If freq change can still keep
CPU utilization under predefined threshold, then transition is viable.

Here the keypoint is PercentBusy and Pc, which may make the
final decision pointless if inaccurate. That's why hardware coordi-
nation logic is required to provide some feedback to get Pc. 

I agree that finally guest should be able to catch up if wrong decision
makes its workload restrained or over-granted. E.g. when there's
only one vcpu active on pcpu which requests a medium speed,
100% cycles are granted to make it think that medium speed is
enough for its current workload. Later when other vcpus are active
on same pcpu, its granted cycles reduces and then it may realize
medium speed is not enough and then request to highest speed
which then may adds back some cycles with a lower nice value.

But it's better to do some micro-level analysis to understand 
whether it works as expected, and more important, how fast this
catch-up may be. Note that guest is checking freq change at 
like 20ms level, and we then need make sure no thrash is caused
to mess both guest and host.

Actually another concern just raised is the measurement to
PercentBusy. Take Linux for example, it normally substracts
idle time from elapsed time. If guest doesn't understand steal
time, PercentBusy may not reflect the fact at all. For example,
say a vcpu is continuously busy for 10ms, and then happens
to enter idle loop and then scheduled out for 20ms. Next time 
when re-scheduled in, its dbs timer will get PercentBusy as 
33.3% though actually its fully busy. However when vcpu is 
scheduled out outside of idle loop, that steal time is calculated 
as busy portion. I'm still not clear how this may affect this patch...

>
>I don't think it's easily deduced.  I also don't think 
>APERF/MPERF are emulated in
>kvm yet.  I suppose it wouldn't be difficult to add those two, though
>measuring that might be a bit messy.
>
>Maybe the cheap workaround for now is to report the CPU speeds in the
>table as n-1, n, n+1.

Yes, we may report at least at qualitative level to see effect.

>
>I'll run some benchmarks and see what happens over the next week.
>

Thanks to your work.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests toinfluencenice priority

Reply via email to