On Wednesday, August 03, 2016 11:53:23 PM Doug Smythies wrote: > On 2016.08.03 21:19 Doug Smythies wrote: > > Re-sending without the previously attached graph. > > Hi Rafael, > > Hope this feedback and test results help. > > On 2016.07.31 16:49 Rafael J. Wysocki wrote: > > > The PID-base P-state selection algorithm used by intel_pstate for > > Core processors is based on very weak foundations. > > Agree, very very much. > > ...[cut]... > > > Consequently, the only viable way to fix that is to replace the > > erroneous algorithm entirely with a better one. > > Agree, very very much. > > > To that end, notice that setting the P-state proportional to the > > actual CPU utilization (measured with the help of MPERF and TSC) > > generally leads to reasonable behavior, but it does not reflect > > the "performance boosting" nature of the current P-state > > selection algorithm. It may be made more similar to that > > algorithm, though, by adding iowait boosting to it. > > Which is good and does help a lot for the IO case, but issues > remain for the compute case. > > ...[cut]... > > > +static inline int32_t get_target_pstate_default(struct cpudata *cpu) > > +{ > > + struct sample *sample = &cpu->sample; > > + int32_t busy_frac; > > + int pstate; > > + > > + busy_frac = div_fp(sample->mperf, sample->tsc); > > + sample->busy_scaled = busy_frac * 100; > > + > > + if (busy_frac < cpu->iowait_boost) > > + busy_frac = cpu->iowait_boost; > > + > > + cpu->iowait_boost >>= 1; > > + > > + pstate = cpu->pstate.turbo_pstate; > > + return fp_toint((pstate + (pstate >> 2)) * busy_frac); > > +} > > + > > The response curve is not normalized on the lower end to the minimum > pstate for the processor, meaning the overall response will vary > between processors as a function of minimum pstate.
But that's OK IMO. Mapping busy_frac = 0 to the minimum P-state would over-provision workloads with small values of busy_frac. > The clamping at maximum pstate at about 80% load seems at little high > to me. Work I have done in various attempts to bring back the use of actual > load > has always ended up achieving maximum pstate before 80% load for best results. > Even the get_target_pstate_cpu_load people reach the max pstate faster, and > they > are more about energy than performance. > What was the criteria for the decision here? Are test results available for > review > and/or duplication by others? This follows the coefficient used by the schedutil governor, but then the metric is different, so quite possibly a different value may work better here. We'll test other values before applying this for sure. :-) > > Several tests were done with this patch set. > The patch set would not apply to kernel 4.7, but did apply fine to a 4.7+ > kernel > (I did as of 7a66ecf) from a few days ago. > > Test 1: Phoronix ffmpeg test (less time is better): > Reason: Because it suffers from rotating amongst CPUs in an odd way, > challenging for CPU frequency scaling drivers. > This test tends to be an indicator of potential troubles with some games. > Criteria: (Dirk Brandewie): Must match or better acpi_cpufreq - ondemand. > With patch set: 15.8 Seconds average and 24.51 package watts. > Without patch set: 11.61 Seconds average and 27.59 watts. > Conclusion: Significant reduction in performance with proposed patch set. > > Tests 2, 3, 4: Phoronix apache, kernel compile, and postmark tests. > Conclusion: All were similar with and without the patch set, with perhaps a > slight > improvement in power consumption for the postmark test with the patch set. > > Test 5: Random reads within a largish (50 gigabytes) file. > Reason: Because it was a test I used to use with other include or not include > IOWAIT work. > Conclusion: no difference with and without the patch set, likely due to > domination by > long seek times (the file is on a normal disk, not an SSD). > > Test 6: Sequential read of a largish (50 gigabytes) file. > Reason: Because it was a test I used to use with other include or not include > IOWAIT work. > With patch set: 288.38 Seconds; 177.544 MB/Sec; 6.83 Watts. > Without patch set: 292.38 Seconds; 174.99 MB/Sec; 7.08 Watts. > Conclusion: Better performance and better power with the patch set. > > Test 7: Compile the kernel 9 times. > Reason: Just because it was a very revealing test during the > "intel_pstate: Increase hold-off time before busyness is scaled" > discussion / thread(s). > Conclusion: no difference with and without the patch set. > > Test 8: pipe-test between cores. > Reason: Just because it was so useful during the > "cross core scheduling frequency drop bisected to 0c313cb20732" > discussion / thread(s). > With patch set: 73.166 Sec; 3.6576 usec/loop; 2278.53 Joules. > Without Patch set: 74.444 Sec; 3.7205 usec/loop; 2338.79 Joules. > Conclusion: Slightly better performance and better energy with the patch set. > > Test 9: Dougs_specpower simulator (20% load): > Time is fixed, less energy is better. > Reason: During the long > "[intel-pstate driver regression] processor frequency very high even if in > idle" > and subsequent https://bugzilla.kernel.org/show_bug.cgi?id=115771 > discussion / thread(s), some sort of test was needed to try to mimic what > Srinivas > was getting on his fancy SpecPower test platform. So far at least, this test > does that. > Only the 20% load case was created, because that was the biggest problem case > back then. > With patch set: 4 tests at an average of 7197 Joules per test, relatively > high CPU frequencies. > Without the patch set: 4 tests at an average of 5956 Joules per test, > relatively low CPU frequencies. > Conclusion: 21% energy regression with the patch set. > Note: Newer processors might do better than my older i7-2600K. > > Test 10: measure the frequency response curve, fixed work packet method, > 75 hertz work / sleep frequency (all CPU, no IOWAIT): > Reason: To compare to some older data and observe overall. > png graph attached - might get stripped from the distribution lists. > Conclusions: Tends to oscillate, suggesting some sort of damping is needed. > However, any filtering tends to increase the step function load rise time > (see test 11 below, I think there is some wiggle room here). > See also graph which has: with and without patch set; performance mode (for > reference); > Philippe Longepe's cpu_load method also with setpoint 40 (for reference); one > of my previous > attempts at a load related patch set from quite sometime ago (for reference). > > Test 11: Look at the step function load response. From no load to 100% on one > CPU (CPU load only, no IO). > While there is a graph, it is not attached: > Conclusion: The step function response is greatly improved (virtually one > sample time max). > It would probably be O.K. to slow it down a little with a filter so as to > reduce the > tendency to oscillate under periodic load conditions (to a point, at least. A > low enough frequency will > always oscillate) (see the graph for test10). All of the above is useful information, thanks for taking the time to do that work! > Questions: > Is there a migration plan? Not yet. We have quite a lot of testing to do first. > i.e. will there be an attempt to merge the current cpu_load method > and this method into one method? Quite possibly if the results are good enough. > Then possibly the PID controller could be eliminated. Right. Thanks, Rafael