On Mon, Nov 30, 2020 at 11:32 PM Mel Gorman <mgor...@techsingularity.net> wrote: > > On Mon, Nov 30, 2020 at 08:06:44PM +0100, Rafael J. Wysocki wrote: > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > > > b/Documentation/admin-guide/kernel-parameters.txt > > > index 526d65d8573a..5b8545022564 100644 > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > @@ -719,6 +719,24 @@ > > > policy to use. This governor must be registered > > > in the > > > kernel before the cpufreq driver probes. > > > > > > + cpuidle.poll= > > > + [CPU_IDLE] > > > + Format: <int> > > > + Set the time in microseconds a CPU should poll in > > > + cpuidle for a new task before entering a sleep > > > + state. The default is determined by either the > > > + tick or the enabled c-state latencies. Tuning is > > > + not generally recommended but it may be needed > > > + for workloads that are both latency-sensitive > > > + and idling rapidly for short durations. Limiting > > > + c-states can be insufficient if the polling > > > + time is still too short, the application has no > > > + knowledge of /dev/cpu_dma_latency, there are > > > + multiple applications or the environment does > > > + not allow the installation of a userspace tool > > > + that controls cpu_dma_latency. This value may > > > + be ignored by the idle governor (e.g. haltpoll). > > > > OK, we can do this, but I'd use a shorter different description here > > and a more detailed one in the admin-guide documentation. > > > > Also this is about certain drivers only which support the "polling > > idle state" (the ACPI one and intel_idle only AFAICS). So I'm not > > sure about the framework-level tunable here. > > > > Moreover, to be precise, that value is the maximum time to do the > > polling (in one go) in the case when requesting any "physical" idle > > states is likely to hurt energy-efficiency or latency. In particular, > > it doesn't mean that idle CPUs will do the idle polling every time. > > > > At first I was nodding along and thinking "sure". Then I started > thinking about what the configuration space then looks like and how a > user might reasonably interpret it. You were right during the review of > the first version, it's a mess because it's driver specific and difficult > to interpret even on a per-driver basis because there is no control of > when a rescheduling event may occur.
Indeed. > You suggest making poll=0 would be valid but that might be interpreted > as being equivalent to idle=poll on x86 which is not the same thing. > processor_idle and intel_idle would have understandable semantics if the > parameter was maxpoll but it's not as understandable for haltpoll. Well, my point was basically that if the plan was to add a boot parameter to control the polling behavior, it would be prudent to also allow the admin to specify that they didn't want any polling at all. But frankly I was hoping to drive you away from that idea which seems to have worked. :-) > Finally, the parameter partially ties us into the current > implementation. For example, the polling loop is based on clock time but > we know looking up the clock is costly in itself so it's very granular > based on the magic "check every 200 loops" logic meaning we can go over > the expected maxiumum polling inverval. If we ever changed that into a > calibration loop to estimate the number of loops then the polling interval > changes slightly even for the same parameter as we no longer depend on the > granularity of calling local_clock. If we ever decided to use adaptive > polling similar to haltpoll then the behaviour changes again resulting > in bugs because the driver.poll parameter means something new. Right. > Using min_cstate was definitely a hazard because it showed up in both > microbenchmarks and real workloads but you were right, lets only > introduce a tunable when and if there is no other choice in the matter. > > So, informally the following patch is the next candidate. I'm happy to > resend it as a separate mail if you prefer and think the patch is ok. I actually can apply it right away, so no need to resend. Many thanks for looking into this! > --8<-- > cpuidle: Select polling interval based on a c-state with a longer target > residency > > It was noted that a few workloads that idle rapidly regressed when commit > 36fcb4292473 ("cpuidle: use first valid target residency as poll time") > was merged. The workloads in question were heavy communicators that idle > rapidly and were impacted by the c-state exit latency as the active CPUs > were not polling at the time of wakeup. As they were not particularly > realistic workloads, it was not considered to be a major problem. > > Unfortunately, a bug was reported for a real workload in a production > environment that relied on large numbers of threads operating in a worker > pool pattern. These threads would idle for periods of time longer than the > C1 target residency and so incurred the c-state exit latency penalty. The > application is very sensitive to wakeup latency and indirectly relying > on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state: Add > time limit to poll_idle()") to poll for long enough to avoid the exit > latency cost. > > The target residency of C1 is typically very short. On some x86 machines, > it can be as low as 2 microseconds. In poll_idle(), the clock is checked > every POLL_IDLE_RELAX_COUNT interations of cpu_relax() and even one > iteration of that loop can be over 1 microsecond so the polling interval is > very close to the granularity of what poll_idle() can detect. Furthermore, > a basic ping pong workload like perf bench pipe has a longer round-trip > time than the 2 microseconds meaning that the CPU will almost certainly > not be polling when the ping-pong completes. > > This patch selects a polling interval based on an enabled c-state that > has an target residency longer than 10usec. If there is no enabled-cstate > then polling will be up to a TICK_NSEC/16 similar to what it was up until > kernel 4.20. Polling for a full tick is unlikely (rescheduling event) > and is much longer than the existing target residencies for a deep c-state. > > As an example, consider a CPU with the following c-state information from > an Intel CPU; > > residency exit_latency > C1 2 2 > C1E 20 10 > C3 100 33 > C6 400 133 > > The polling interval selected is 20usec. If booted with > intel_idle.max_cstate=1 then the polling interval is 250usec as the deeper > c-states were not available. > > On an AMD EPYC machine, the c-state information is more limited and > looks like > > residency exit_latency > C1 2 1 > C2 800 400 > > The polling interval selected is 250usec. While C2 was considered, the > polling interval was clamped by CPUIDLE_POLL_MAX. > > Note that it is not expected that polling will be a universal win. As > well as potentially trading power for performance, the performance is not > guaranteed if the extra polling prevented a turbo state being reached. > Making it a tunable was considered but it's driver-specific, may be > overridden by a governor and is not a guaranteed polling interval making > it difficult to describe without knowledge of the implementation. > > tbench4 > vanilla polling > Hmean 1 497.89 ( 0.00%) 543.15 * 9.09%* > Hmean 2 975.88 ( 0.00%) 1059.73 * 8.59%* > Hmean 4 1953.97 ( 0.00%) 2081.37 * 6.52%* > Hmean 8 3645.76 ( 0.00%) 4052.95 * 11.17%* > Hmean 16 6882.21 ( 0.00%) 6995.93 * 1.65%* > Hmean 32 10752.20 ( 0.00%) 10731.53 * -0.19%* > Hmean 64 12875.08 ( 0.00%) 12478.13 * -3.08%* > Hmean 128 21500.54 ( 0.00%) 21098.60 * -1.87%* > Hmean 256 21253.70 ( 0.00%) 21027.18 * -1.07%* > Hmean 320 20813.50 ( 0.00%) 20580.64 * -1.12%* > > Signed-off-by: Mel Gorman <mgor...@techsingularity.net> > --- > drivers/cpuidle/cpuidle.c | 25 +++++++++++++++++++++++-- > 1 file changed, 23 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c > index 83af15f77f66..ef2ea1b12cd8 100644 > --- a/drivers/cpuidle/cpuidle.c > +++ b/drivers/cpuidle/cpuidle.c > @@ -368,6 +368,19 @@ void cpuidle_reflect(struct cpuidle_device *dev, int > index) > cpuidle_curr_governor->reflect(dev, index); > } > > +/* > + * Min polling interval of 10usec is a guess. It is assuming that > + * for most users, the time for a single ping-pong workload like > + * perf bench pipe would generally complete within 10usec but > + * this is hardware dependant. Actual time can be estimated with > + * > + * perf bench sched pipe -l 10000 > + * > + * Run multiple times to avoid cpufreq effects. > + */ > +#define CPUIDLE_POLL_MIN 10000 > +#define CPUIDLE_POLL_MAX (TICK_NSEC / 16) > + > /** > * cpuidle_poll_time - return amount of time to poll for, > * governors can override dev->poll_limit_ns if necessary > @@ -382,15 +395,23 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv, > int i; > u64 limit_ns; > > + BUILD_BUG_ON(CPUIDLE_POLL_MIN > CPUIDLE_POLL_MAX); > + > if (dev->poll_limit_ns) > return dev->poll_limit_ns; > > - limit_ns = TICK_NSEC; > + limit_ns = CPUIDLE_POLL_MAX; > for (i = 1; i < drv->state_count; i++) { > + u64 state_limit; > + > if (dev->states_usage[i].disable) > continue; > > - limit_ns = drv->states[i].target_residency_ns; > + state_limit = drv->states[i].target_residency_ns; > + if (state_limit < CPUIDLE_POLL_MIN) > + continue; > + > + limit_ns = min_t(u64, state_limit, CPUIDLE_POLL_MAX); > break; > } >