Hi all,

Any comments for the series?

We had some users (from both client and server) reported spurious
NMI watchdog timeouts issue.
Now, there is a proposed workaround fix from tglx. We are testing it.
However, this patch series is believed to be a real fix for the issue.
It's better that the patch series can be merged into mainline.


Here is the workaround fix, if you are interested.
https://patchwork.kernel.org/patch/9803033/
https://patchwork.kernel.org/patch/9805903/

Thanks,
Kan

> 
> From: Kan Liang <[email protected]>
> 
> The dominant motivation is to make it possible to switch cycles NMI
> watchdog to ref_cycles on x86 platform. The new NMI watchdog could
>  - Free widely used precious cycles fixed counter. For example, topdown
>    code needs the cycles fixed counter in group counting. Otherwise,
>    it has to fall back to not use groups, which can cause measurement
>    inaccuracy due to multiplexing errors.
>  - Avoiding the NMI watchdog expiring too fast. Cycles can tick faster
>    than the measured CPU Frequency due to Turbo mode.
>    Although we can enlarge the period of cycles to workaround the fast
>    expiring, it is still limited by the Turbo ratio and may fail
>    eventually.
> 
> CPU ref_cycles can only be counted by fixed counter 2. It uses pseudo-
> encoding. The GP counter doesn't recognize.
> 
> BUS_CYCLES (0x013c) is another event which is not affected by core
> frequency changes. It has a constant ratio with the CPU ref_cycles.
> BUS_CYCLES could be used as an alternative event for ref_cycles on GP
> counter.
> A hook is implemented in x86_schedule_events. If the fixed counter 2 is
> occupied and a GP counter is assigned, BUS_CYCLES is used to replace
> ref_cycles. A new flag PERF_X86_EVENT_REF_CYCLES_REP in hw_perf_event
> is introduced to indicate the replacement.
> To make the switch transparent for perf tool, counting and sampling are also
> specially handled.
>  - For counting, it multiplies the result with the constant ratio after
>    reading it.
>  - For sampling with fixed period, the BUS_CYCLES period = ref_cycles
>    period / the constant ratio.
>  - For sampling with fixed frequency, the adaptive frequency algorithm
>    will figure it out by calculating ref_cycles event's last_period and
>    event counts. But the reload value has to be BUS_CYCLES left period.
> 
> User visible RDPMC issue
> It is impossible to transparently handle the case, which reading ref-cycles by
> RDPMC instruction in user space.
> ref_cycles_factor_for_rdpmc is exposed for RDPMC user.
> For the few users who want to read ref-cycles by RDPMC, they have to
> correct the result by multipling ref_cycles_factor_for_rdpmc manually, if GP
> counter is used.
> The solution is only for ref-cycles. It doesn't impact other events'
> result.
> 
> The constant ratio is model specific.
> For the model after NEHALEM but before Skylake, the ratio is defined in
> MSR_PLATFORM_INFO.
> For the model after Skylake, it can be get from CPUID.15H.
> For Knights Landing, Goldmont and later, the ratio is always 1.
> 
> The old Silvermont/Airmont, Core2 and Atom machines are not covered by
> the patch. The behavior on those machines will not change.
> 
> Signed-off-by: Kan Liang <[email protected]>
> ---
> 
> Changes since V1:
>  - Retry the whole scheduling thing for 0x0300 replacement event,
>    if 0x0300 event is unscheduled.
>  - Correct the method to calculate event->counter, period_left, left
>    and offset in different cases
>  - Expose the factor to user space
>  - Modify the changelog
>  - There is no transparent way to handle ref-cycles replacement events for
>    RDPMC. RDPMC users have to multiply the results with factor if the
>    ref-cycles replacement event is scheduled in GP counters.
>    But it will not impact other events.
>    There are few RDPMC users who use ref cycles. The impact should be
> limited.
>    I will also send patches to other user tools, such as pmu-tools and PAPI,
>    to minimize the impact.
> 
> 
>  arch/x86/events/core.c       |  92 ++++++++++++++++++++++++++++++++++--
>  arch/x86/events/intel/core.c | 110
> ++++++++++++++++++++++++++++++++++++++++++-
>  arch/x86/events/perf_event.h |   5 ++
>  3 files changed, 203 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> e6f5e4b..18f8d37 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -70,7 +70,7 @@ u64 x86_perf_event_update(struct perf_event *event)
>       int shift = 64 - x86_pmu.cntval_bits;
>       u64 prev_raw_count, new_raw_count;
>       int idx = hwc->idx;
> -     u64 delta;
> +     u64 delta, adjust_delta;
> 
>       if (idx == INTEL_PMC_IDX_FIXED_BTS)
>               return 0;
> @@ -101,8 +101,47 @@ u64 x86_perf_event_update(struct perf_event
> *event)
>       delta = (new_raw_count << shift) - (prev_raw_count << shift);
>       delta >>= shift;
> 
> -     local64_add(delta, &event->count);
> -     local64_sub(delta, &hwc->period_left);
> +     /*
> +      * Correct the count number if applying ref_cycles replacement.
> +      * There is a constant ratio between ref_cycles (event A) and
> +      * ref_cycles replacement (event B). The delta result is from event B.
> +      * To get accurate event A count, the real delta result must be
> +      * multiplied with the constant ratio.
> +      *
> +      * It is handled differently for sampling and counting.
> +      * - Fixed period Sampling: The period is already adjusted in
> +      *   scheduling for event B. The period_left is the remaining period
> +      *   for event B. Don't need to modify period_left.
> +      *   It's enough to only adjust event->count here.
> +      *
> +      * - Fixed freq Sampling: The adaptive frequency algorithm needs
> +      *   the last_period and event counts for event A to estimate the next
> +      *   period for event A. So the period_left is the remaining period
> +      *   for event A. It needs to multiply the result with the ratio.
> +      *   However, the period_left is also used to reload the event counter
> +      *   for event B in x86_perf_event_set_period. It has to be adjusted to
> +      *   event B's remaining period there.
> +      *
> +      * - Counting: It's enough to only adjust event->count for perf stat.
> +      *
> +      * - RDPMC: User can also read the counter directly by rdpmc/mmap.
> +      *   Users have to handle the adjustment themselves.
> +      *   For kernel, it only needs to guarantee that the offset is correct.
> +      *   In x86's arch_perf_update_userpage, the offset will be corrected
> +      *   if event B is used.
> +      */
> +     adjust_delta = delta;
> +     if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> +             adjust_delta = delta * x86_pmu.ref_cycles_factor;
> +
> +             if (is_sampling_event(event) && event->attr.freq)
> +                     local64_sub(adjust_delta, &hwc->period_left);
> +             else
> +                     local64_sub(delta, &hwc->period_left);
> +     } else
> +             local64_sub(delta, &hwc->period_left);
> +
> +     local64_add(adjust_delta, &event->count);
> 
>       return new_raw_count;
>  }
> @@ -865,6 +904,7 @@ int x86_schedule_events(struct cpu_hw_events
> *cpuc, int n, int *assign)
>       if (x86_pmu.start_scheduling)
>               x86_pmu.start_scheduling(cpuc);
> 
> +retry:
>       for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
>               cpuc->event_constraint[i] = NULL;
>               c = x86_pmu.get_event_constraints(cpuc, i, cpuc-
> >event_list[i]); @@ -952,6 +992,25 @@ int x86_schedule_events(struct
> cpu_hw_events *cpuc, int n, int *assign)
>                        */
>                       if (x86_pmu.put_event_constraints)
>                               x86_pmu.put_event_constraints(cpuc, e);
> +
> +                     /*
> +                      * 0x0300 is pseudo-encoding for REF_CPU_CYCLES.
> +                      * It can only be scheduled on fixed counter 2.
> +                      * If the fixed counter 2 is occupied, the event is
> +                      * failed scheduling.
> +                      *
> +                      * If that's the case, an alternative event, which
> +                      * can be counted in GP counters, could be used to
> +                      * replace the pseudo-encoding REF_CPU_CYCLES
> event.
> +                      *
> +                      * Here we reset the event and retry the whole
> +                      * scheduling thing if there is unsched 0x0300 event.
> +                      */
> +                     if (unsched && x86_pmu.ref_cycles_rep &&
> +                         ((e->hw.config & X86_RAW_EVENT_MASK) ==
> 0x0300)) {
> +                             x86_pmu.ref_cycles_rep(e);
> +                             goto retry;
> +                     }
>               }
>       }
> 
> @@ -1136,6 +1195,14 @@ int x86_perf_event_set_period(struct perf_event
> *event)
>               hwc->last_period = period;
>               ret = 1;
>       }
> +
> +     /*
> +      * Adjust left for ref_cycles replacement.
> +      * Please refer the comments in x86_perf_event_update for details.
> +      */
> +     if ((hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) &&
> +         event->attr.freq)
> +             left /= (u64)x86_pmu.ref_cycles_factor;
>       /*
>        * Quirk: certain CPUs dont like it if just 1 hw_event is left:
>        */
> @@ -1823,6 +1890,14 @@ static int __init init_hw_perf_events(void)
>                       x86_pmu_attr_group.attrs = tmp;
>       }
> 
> +     if (x86_pmu.factor_attrs) {
> +             struct attribute **tmp;
> +
> +             tmp = merge_attr(x86_pmu_attr_group.attrs,
> x86_pmu.factor_attrs);
> +             if (!WARN_ON(!tmp))
> +                     x86_pmu_attr_group.attrs = tmp;
> +     }
> +
>       pr_info("... version:                %d\n",     x86_pmu.version);
>       pr_info("... bit width:              %d\n",     x86_pmu.cntval_bits);
>       pr_info("... generic registers:      %d\n",     x86_pmu.num_counters);
> @@ -2300,6 +2375,17 @@ void arch_perf_update_userpage(struct
> perf_event *event,
>       }
> 
>       cyc2ns_read_end(data);
> +
> +     /*
> +      * offset should be corrected if ref cycles replacement is used.
> +      * Please refer the comments in x86_perf_event_update for details.
> +      */
> +     if (event->hw.flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> +             userpg->offset = __perf_event_count(event);
> +             userpg->offset /= (u64)x86_pmu.ref_cycles_factor;
> +             if (userpg->index)
> +                     userpg->offset -= local64_read(&event-
> >hw.prev_count);
> +     }
>  }
> 
>  void
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index
> da9047e..4853795 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3063,6 +3063,55 @@ static unsigned bdw_limit_period(struct
> perf_event *event, unsigned left)
>       return left;
>  }
> 
> +static __init unsigned int glm_get_ref_cycles_factor(void) {
> +     return 1;
> +}
> +
> +static __init unsigned int nhm_get_ref_cycles_factor(void) {
> +     u64 platform_info;
> +
> +     rdmsrl(MSR_PLATFORM_INFO, platform_info);
> +     return (platform_info >> 8) & 0xff;
> +}
> +
> +static __init unsigned int skl_get_ref_cycles_factor(void) {
> +     unsigned int cpuid21_eax, cpuid21_ebx, cpuid21_ecx, unused;
> +
> +     cpuid(21, &cpuid21_eax, &cpuid21_ebx, &cpuid21_ecx, &unused);
> +     if (!cpuid21_eax || !cpuid21_ebx)
> +             return 0;
> +
> +     return cpuid21_ebx / cpuid21_eax;
> +}
> +
> +/*
> + * BUS_CYCLES (0x013c) is another event which is not affected by core
> + * frequency changes. It has a constant ratio with the CPU ref_cycles.
> + * BUS_CYCLES could be used as an alternative event for ref_cycles on
> + * GP counter.
> + */
> +void nhm_ref_cycles_rep(struct perf_event *event) {
> +     struct hw_perf_event *hwc = &event->hw;
> +
> +     hwc->config = (hwc->config & ~X86_RAW_EVENT_MASK) |
> +
> intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES];
> +     hwc->flags |= PERF_X86_EVENT_REF_CYCLES_REP;
> +
> +     /* adjust the sample period for ref_cycles replacement event */
> +     if (is_sampling_event(event) && hwc->sample_period != 1) {
> +
> +             hwc->sample_period /= (u64)x86_pmu.ref_cycles_factor;
> +             if (hwc->sample_period < 2)
> +                     hwc->sample_period = 2;
> +             hwc->last_period = hwc->sample_period;
> +             local64_set(&hwc->period_left, hwc->sample_period);
> +     }
> +}
> +
>  PMU_FORMAT_ATTR(event,       "config:0-7"    );
>  PMU_FORMAT_ATTR(umask,       "config:8-15"   );
>  PMU_FORMAT_ATTR(edge,        "config:18"     );
> @@ -3656,6 +3705,21 @@ static struct attribute *intel_pmu_attrs[] = {
>       NULL,
>  };
> 
> +
> +static ssize_t ref_cycles_factor_for_rdpmc_show(struct device *cdev,
> +                                             struct device_attribute *attr,
> +                                             char *buf)
> +{
> +     return sprintf(buf, "%u\n", x86_pmu.ref_cycles_factor); }
> +
> +DEVICE_ATTR_RO(ref_cycles_factor_for_rdpmc);
> +
> +static struct attribute *intel_ref_cycles_factor_attrs[] = {
> +     &dev_attr_ref_cycles_factor_for_rdpmc.attr,
> +     NULL,
> +};
> +
>  __init int intel_pmu_init(void)
>  {
>       union cpuid10_edx edx;
> @@ -3775,7 +3839,11 @@ __init int intel_pmu_init(void)
> 
>               intel_pmu_pebs_data_source_nhm();
>               x86_add_quirk(intel_nehalem_quirk);
> -
> +             x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
>               pr_cont("Nehalem events, ");
>               break;
> 
> @@ -3835,6 +3903,11 @@ __init int intel_pmu_init(void)
>               x86_pmu.lbr_pt_coexist = true;
>               x86_pmu.flags |= PMU_FL_HAS_RSP_1;
>               x86_pmu.cpu_events = glm_events_attrs;
> +             x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
>               pr_cont("Goldmont events, ");
>               break;
> 
> @@ -3864,6 +3937,11 @@ __init int intel_pmu_init(void)
> 
>       X86_CONFIG(.event=0xb1, .umask=0x3f, .inv=1, .cmask=1);
> 
>               intel_pmu_pebs_data_source_nhm();
> +             x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
>               pr_cont("Westmere events, ");
>               break;
> 
> @@ -3899,6 +3977,11 @@ __init int intel_pmu_init(void)
>               /* UOPS_DISPATCHED.THREAD,c=1,i=1 to count stall cycles*/
> 
>       intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BAC
> KEND] =
> 
>       X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1);
> +             x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
> 
>               pr_cont("SandyBridge events, ");
>               break;
> @@ -3933,6 +4016,11 @@ __init int intel_pmu_init(void)
>               /* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
> 
>       intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRO
> NTEND] =
> 
>       X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
> +             x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
> 
>               pr_cont("IvyBridge events, ");
>               break;
> @@ -3962,6 +4050,11 @@ __init int intel_pmu_init(void)
>               x86_pmu.get_event_constraints =
> hsw_get_event_constraints;
>               x86_pmu.cpu_events = hsw_events_attrs;
>               x86_pmu.lbr_double_abort = true;
> +             x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
>               pr_cont("Haswell events, ");
>               break;
> 
> @@ -3998,6 +4091,11 @@ __init int intel_pmu_init(void)
>               x86_pmu.get_event_constraints =
> hsw_get_event_constraints;
>               x86_pmu.cpu_events = hsw_events_attrs;
>               x86_pmu.limit_period = bdw_limit_period;
> +             x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
>               pr_cont("Broadwell events, ");
>               break;
> 
> @@ -4016,6 +4114,11 @@ __init int intel_pmu_init(void)
>               /* all extra regs are per-cpu when HT is on */
>               x86_pmu.flags |= PMU_FL_HAS_RSP_1;
>               x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
> +             x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
> 
>               pr_cont("Knights Landing/Mill events, ");
>               break;
> @@ -4051,6 +4154,11 @@ __init int intel_pmu_init(void)
>                                                 skl_format_attr);
>               WARN_ON(!x86_pmu.format_attrs);
>               x86_pmu.cpu_events = hsw_events_attrs;
> +             x86_pmu.ref_cycles_factor = skl_get_ref_cycles_factor();
> +             if (x86_pmu.ref_cycles_factor) {
> +                     x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +                     x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +             }
>               pr_cont("Skylake events, ");
>               break;
> 
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 53728ee..3470178 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -68,6 +68,7 @@ struct event_constraint {
>  #define PERF_X86_EVENT_EXCL_ACCT     0x0200 /* accounted EXCL event */
>  #define PERF_X86_EVENT_AUTO_RELOAD   0x0400 /* use PEBS auto-
> reload */
>  #define PERF_X86_EVENT_FREERUNNING   0x0800 /* use freerunning
> PEBS */
> +#define PERF_X86_EVENT_REF_CYCLES_REP        0x1000 /* use ref_cycles
> replacement */
> 
> 
>  struct amd_nb {
> @@ -550,6 +551,8 @@ struct x86_pmu {
>       int             perfctr_second_write;
>       bool            late_ack;
>       unsigned        (*limit_period)(struct perf_event *event, unsigned l);
> +     unsigned int    ref_cycles_factor;
> +     void            (*ref_cycles_rep)(struct perf_event *event);
> 
>       /*
>        * sysfs attrs
> @@ -565,6 +568,8 @@ struct x86_pmu {
>       unsigned long   attr_freeze_on_smi;
>       struct attribute **attrs;
> 
> +     unsigned int    attr_ref_cycles_factor;
> +     struct attribute **factor_attrs;
>       /*
>        * CPU Hotplug hooks
>        */
> --
> 2.7.4

Reply via email to