[PATCH] book3s_hv_rmhandlers:Pass the correct trap argument to kvmhv_commence_exit
In guest_exit_cont we call kvmhv_commence_exit which expects the trap number as the argument. However r3 doesn't contain the trap number at this point and as a result we would be calling the function with a spurious trap number. Fix this by copying r12 into r3 before calling kvmhv_commence_exit as r12 contains the trap number Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 4d70df2..f0d7c54 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1170,6 +1170,7 @@ mc_cont: bl kvmhv_accumulate_time #endif + mr r3, r12 /* Increment exit count, poke other threads to exit */ bl kvmhv_commence_exit nop -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: Add an inline function to update HID0
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual prescribes that updates to HID0 be preceded by a SYNC instruction and followed by an ISYNC instruction (Page 91). Create a function name update_hid0() which follows this recipe and invoke it from the static split core path. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/kvm_ppc.h | 11 +++ arch/powerpc/platforms/powernv/subcore.c | 4 ++-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index c6ef05b..325f1d6 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -685,4 +685,15 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb) extern void xics_wake_cpu(int cpu); +static inline void update_hid0(unsigned long hid0) +{ + /* +* The HID0 update should at the very least be preceded by a +* a SYNC instruction followed by an ISYNC instruction +*/ + mb(); + mtspr(SPRN_HID0, hid0); + isync(); +} + #endif /* __POWERPC_KVM_PPC_H__ */ diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c index f60f80a..37e77bf 100644 --- a/arch/powerpc/platforms/powernv/subcore.c +++ b/arch/powerpc/platforms/powernv/subcore.c @@ -190,7 +190,7 @@ static void unsplit_core(void) hid0 = mfspr(SPRN_HID0); hid0 &= ~HID0_POWER8_DYNLPARDIS; - mtspr(SPRN_HID0, hid0); + update_hid0(hid0); update_hid_in_slw(hid0); while (mfspr(SPRN_HID0) & mask) @@ -227,7 +227,7 @@ static void split_core(int new_mode) /* Write new mode */ hid0 = mfspr(SPRN_HID0); hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value; - mtspr(SPRN_HID0, hid0); + update_hid0(hid0); update_hid_in_slw(hid0); /* Wait for it to happen */ -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: Add an inline function to update HID0
Hi Michael, On Tue, Aug 04, 2015 at 08:08:58PM +1000, Michael Ellerman wrote: > On Tue, 2015-04-08 at 08:30:58 UTC, "Gautham R. Shenoy" wrote: > > Section 3.7 of Version 1.2 of the Power8 Processor User's Manual > > prescribes that updates to HID0 be preceded by a SYNC instruction and > > followed by an ISYNC instruction (Page 91). > > > > Create a function name update_hid0() which follows this recipe and > > invoke it from the static split core path. > > > > Signed-off-by: Gautham R. Shenoy > > --- > > arch/powerpc/include/asm/kvm_ppc.h | 11 +++ > > Why is it in there? It's not KVM related per se. Ok. Will fix this. > > Where should it go? I think reg.h would be best, ideally near the definition > for HID0, though that's probably not possible because of ASSEMBLY > requirements. > So at the bottom of reg.h ? > > > diff --git a/arch/powerpc/include/asm/kvm_ppc.h > > b/arch/powerpc/include/asm/kvm_ppc.h > > index c6ef05b..325f1d6 100644 > > --- a/arch/powerpc/include/asm/kvm_ppc.h > > +++ b/arch/powerpc/include/asm/kvm_ppc.h > > @@ -685,4 +685,15 @@ static inline ulong kvmppc_get_ea_indexed(struct > > kvm_vcpu *vcpu, int ra, int rb) > > > > extern void xics_wake_cpu(int cpu); > > > > +static inline void update_hid0(unsigned long hid0) > > +{ > > + /* > > +* The HID0 update should at the very least be preceded by a > > +* a SYNC instruction followed by an ISYNC instruction > > +*/ > > + mb(); > > + mtspr(SPRN_HID0, hid0); > > + isync(); > > That's going to turn into three separate inline asm blocks, which is maybe a > bit unfortunate. Have you checked the generated code is what we want, ie. just > sync, mtspr, isync ? > Yes, the objdump of subcore.o shows exactly these three instructions: 7c 00 04 ac sync 7c 70 fb a6 mtspr 1008,r3 4c 00 01 2c isync > cheers > -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: Add an inline function to update HID0
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual prescribes that updates to HID0 be preceded by a SYNC instruction and followed by an ISYNC instruction (Page 91). Create an inline function name update_hid0() which follows this recipe and invoke it from the static split core path. Signed-off-by: Gautham R. Shenoy --- [v1--> v2: Moved defn of update_hid0 to reg.h from kvm_ppc.h] arch/powerpc/include/asm/reg.h | 13 + arch/powerpc/platforms/powernv/subcore.c | 4 ++-- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index af56b5c..beb98ac 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -12,6 +12,8 @@ #include #include +#include +#include /* Pickup Book E specific registers. */ #if defined(CONFIG_BOOKE) || defined(CONFIG_40x) @@ -1281,6 +1283,17 @@ struct pt_regs; extern void ppc_save_regs(struct pt_regs *regs); +static inline void update_hid0(unsigned long hid0) +{ + /* +* The HID0 update should at the very least be preceded by a +* a SYNC instruction followed by an ISYNC instruction +*/ + mb(); + mtspr(SPRN_HID0, hid0); + isync(); +} + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_REG_H */ diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c index f60f80a..37e77bf 100644 --- a/arch/powerpc/platforms/powernv/subcore.c +++ b/arch/powerpc/platforms/powernv/subcore.c @@ -190,7 +190,7 @@ static void unsplit_core(void) hid0 = mfspr(SPRN_HID0); hid0 &= ~HID0_POWER8_DYNLPARDIS; - mtspr(SPRN_HID0, hid0); + update_hid0(hid0); update_hid_in_slw(hid0); while (mfspr(SPRN_HID0) & mask) @@ -227,7 +227,7 @@ static void split_core(int new_mode) /* Write new mode */ hid0 = mfspr(SPRN_HID0); hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value; - mtspr(SPRN_HID0, hid0); + update_hid0(hid0); update_hid_in_slw(hid0); /* Wait for it to happen */ -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: Add an inline function to update HID0
Hi Segher, Thanks for the suggestions. I will rename the function to update_power8_hid0() and use asm volatile. On Tue, Aug 04, 2015 at 09:30:57PM -0500, Segher Boessenkool wrote: > On Tue, Aug 04, 2015 at 08:08:58PM +1000, Michael Ellerman wrote: > > > +static inline void update_hid0(unsigned long hid0) > > > +{ > > > + /* > > > + * The HID0 update should at the very least be preceded by a > > > + * a SYNC instruction followed by an ISYNC instruction > > > + */ > > > + mb(); > > > + mtspr(SPRN_HID0, hid0); > > > + isync(); > > > > That's going to turn into three separate inline asm blocks, which is maybe a > > bit unfortunate. Have you checked the generated code is what we want, ie. > > just > > sync, mtspr, isync ? > > The "mb()" is not such a great name anyway: you don't want a memory > barrier, you want an actual sync instruction ("sync 0", "hwsync", > whatever the currently preferred spelling is). > > The function name should also say this is for POWER8 (the required > sequences are different for some other processors; and some others > might not even _have_ a HID0, or not at 1008). power8_write_hid0 > or such? > > For writing it as one asm, why not just > > asm volatile("sync ; mtspr %0,%1 ; isync" : : "i"(SPRN_HID0), "r"(hid0)); > > instead of the stringify stuff? > > > Segher > -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3] powerpc: Add an inline function to update POWER8 HID0
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual prescribes that updates to HID0 be preceded by a SYNC instruction and followed by an ISYNC instruction (Page 91). Create an inline function name update_power8_hid0() which follows this recipe and invoke it from the static split core path. Signed-off-by: Gautham R. Shenoy --- [v1 --> v2: Moved defn of update_hid0 to reg.h from kvm_ppc.h] [v2 --> v3: Renamed to update_power8_hid0 and used asm volatile] arch/powerpc/include/asm/reg.h | 9 + arch/powerpc/platforms/powernv/subcore.c | 4 ++-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index af56b5c..1245d99 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -1281,6 +1281,15 @@ struct pt_regs; extern void ppc_save_regs(struct pt_regs *regs); +static inline void update_power8_hid0(unsigned long hid0) +{ + /* +* The HID0 update on Power8 should at the very least be +* preceded by a a SYNC instruction followed by an ISYNC +* instruction +*/ + asm volatile("sync; mtspr %0,%1; isync":: "i"(SPRN_HID0), "r"(hid0)); +} #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_REG_H */ diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c index f60f80a..503a73f 100644 --- a/arch/powerpc/platforms/powernv/subcore.c +++ b/arch/powerpc/platforms/powernv/subcore.c @@ -190,7 +190,7 @@ static void unsplit_core(void) hid0 = mfspr(SPRN_HID0); hid0 &= ~HID0_POWER8_DYNLPARDIS; - mtspr(SPRN_HID0, hid0); + update_power8_hid0(hid0); update_hid_in_slw(hid0); while (mfspr(SPRN_HID0) & mask) @@ -227,7 +227,7 @@ static void split_core(int new_mode) /* Write new mode */ hid0 = mfspr(SPRN_HID0); hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value; - mtspr(SPRN_HID0, hid0); + update_power8_hid0(hid0); update_hid_in_slw(hid0); /* Wait for it to happen */ -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] book3s_hv_rmhandlers:Pass the correct trap argument to kvmhv_commence_exit
Hi Sam, On Fri, Aug 14, 2015 at 03:07:28PM +1000, Sam Bobroff wrote: > On Thu, May 21, 2015 at 01:57:04PM +0530, Gautham R. Shenoy wrote: > > In guest_exit_cont we call kvmhv_commence_exit which expects the trap > > number as the argument. However r3 doesn't contain the trap number at > > this point and as a result we would be calling the function with a > > spurious trap number. > > > > Fix this by copying r12 into r3 before calling kvmhv_commence_exit as > > r12 contains the trap number > > > > Signed-off-by: Gautham R. Shenoy > > Hi Gautham, > > I agree with your logic: r3 is quite clearly corrupted in that path. So: > > Reviewed-by: Sam Bobroff > > Just one comment: Do you have a case of this causing some visible problem due > to the corrupted trap number? (I'll test the patch if you do.) > Actually no! I found this bug while reviewing the code for some other issue. > Cheers, > Sam. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/cpufreq: Add pr_warn() on OPAL firmware failures
On Sun, Aug 03, 2014 at 02:54:05PM +0530, Vaidyanathan Srinivasan wrote: > @@ -131,7 +136,12 @@ static unsigned int pstate_id_to_freq(int pstate_id) > int i; > > i = powernv_pstate_info.max - pstate_id; > - BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0); > + if (i >= powernv_pstate_info.nr_pstates || i < 0) { > + pr_warn("PState id %d outside of PState table, " > + "reporting nominal id %d instead\n", > + pstate_id, powernv_pstate_info.nominal); > + i = powernv_pstate_info.max - powernv_pstate_info.nominal; As of now the default loglevel corresponds to KERN_WARNING so this warning should get printed anyway. However, don't you think it would be better if we make it a pr_err( ) since it's a platform error that's causing the pstate_id to go out of bounds ? Otherwise it looks ok. Acked-by: Gautham R. Shenoy > + } > > return powernv_freqs[i].frequency; > } -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/4] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.
From: "Gautham R. Shenoy" Move the code that computes the cpumask corresponding to the thread-siblings of a cpu to a new method named powernv_cpu_to_core_mask() so that it can be used by other places in the code. Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index c59eb26..f0dae6f 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -166,6 +166,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = { /* Helper routines */ +/** + * Sets the bits corresponding to the thread-siblings of cpu in its core + * in 'cpus'. + */ +static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus) +{ + int base, i; + + base = cpu_first_thread_sibling(cpu); + + for (i = 0; i < threads_per_core; i++) { + cpumask_set_cpu(base + i, cpus); + } + + return; +} + /* Access helpers to power mgt SPR */ static inline unsigned long get_pmspr(unsigned long sprn) @@ -231,13 +248,10 @@ static int powernv_set_freq(cpumask_var_t cpus, unsigned int new_index) static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) { - int base, i; + int i; #ifdef CONFIG_SMP - base = cpu_first_thread_sibling(policy->cpu); - - for (i = 0; i < threads_per_core; i++) - cpumask_set_cpu(base + i, policy->cpus); + powernv_cpu_to_core_mask(policy->cpu, policy->cpus); #endif policy->cpuinfo.transition_latency = 25000; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/4] powernv:cpufreq: Create pstate_id_to_freq() helper
From: "Gautham R. Shenoy" Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 28 1 file changed, 28 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 345501e..d0a8dee 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -37,6 +37,15 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock); #define POWERNV_MAX_PSTATES256 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; +struct powernv_pstate_info { + int pstate_min_id; + int pstate_max_id; + int pstate_nominal_id; + int nr_pstates; +}; + +static struct powernv_pstate_info powernv_pstate_info; + /* * Initialize the freq table based on data obtained @@ -106,9 +115,28 @@ static int init_powernv_pstates(void) powernv_freqs[i].driver_data = 0; powernv_freqs[i].frequency = CPUFREQ_TABLE_END; + powernv_pstate_info.pstate_min_id = pstate_min; + powernv_pstate_info.pstate_max_id = pstate_max; + powernv_pstate_info.pstate_nominal_id = pstate_nominal; + powernv_pstate_info.nr_pstates = nr_pstates; + return 0; } +/** + * Returns the cpu frequency corresponding to the pstate_id. + */ +static unsigned int pstate_id_to_freq(int pstate_id) +{ + int i; + + i = powernv_pstate_info.pstate_max_id - pstate_id; + + BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0); + WARN_ON(powernv_freqs[i].driver_data != pstate_id); + return powernv_freqs[i].frequency; +} + static struct freq_attr *powernv_cpu_freq_attr[] = { &cpufreq_freq_attr_scaling_available_freqs, NULL, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/4] powernv:cpufreq: Export nominal frequency via sysfs.
From: "Gautham R. Shenoy" Create a driver attribute named cpuinfo_nominal_frequency which creates a sysfs read-only file named cpuinfo_nominal_frequency. Export the frequency corresponding to the nominal_pstate through this interface. Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index d0a8dee..c59eb26 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -137,8 +137,30 @@ static unsigned int pstate_id_to_freq(int pstate_id) return powernv_freqs[i].frequency; } +/** + * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by + * the firmware + */ +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy, + char *buf) +{ + int nominal_freq; + nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id); + return sprintf(buf, "%u\n", nominal_freq); +} + + +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = { + .attr = { .name = "cpuinfo_nominal_freq", + .mode = 0444, + }, + .show = show_cpuinfo_nominal_freq, +}; + + static struct freq_attr *powernv_cpu_freq_attr[] = { &cpufreq_freq_attr_scaling_available_freqs, + &cpufreq_freq_attr_cpuinfo_nominal_freq, NULL, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/4] powernv:cpufreq: Export nominal and current frequency via sysfs
From: "Gautham R. Shenoy" Hi, On IBM POWERNV platforms[1], presently we do not report the current operating frequency of the processor through the sysfs interface "cpuinfo_cur_freq" since the cpu frequency driver[1] has not implemented the ->get(unsigned int cpu) method. Fix this by implementing the ->get(unsigned int cpu) method which will report the frequency corresponding to the pstate_id on PMSR on "cpu". Also, export the nominal frequency of the processor through the sysfs interface "cpuinfo_nominal_freq" by defining it as a driver attribute. The patch series depends on the cpu-frequency driver patch-series[1]. [1] http://comments.gmane.org/gmane.linux.ports.ppc.embedded/67905 Thanks and Regards gautham. Summary: == Gautham R. Shenoy (4): powernv:cpufreq: Create pstate_id_to_freq() helper powernv:cpufreq: Export nominal frequency via sysfs. powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper. powernv:cpufreq: Implement the driver->get() method drivers/cpufreq/powernv-cpufreq.c | 115 -- 1 file changed, 110 insertions(+), 5 deletions(-) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/4] powernv:cpufreq: Implement the driver->get() method
From: "Gautham R. Shenoy" The current frequency of a cpu is reported through the sysfs file cpuinfo_cur_freq. This requires the driver to implement a "->get(unsigned int cpu)" method which will return the current operating frequency. Implement a function named powernv_cpufreq_get() which corresponds to the required ->get() method. Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 41 +++ 1 file changed, 41 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index f0dae6f..5f43e4f 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -218,6 +218,46 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val) BUG(); } +/** + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_get_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + long pstate_id; + int *cur_freq, freq; + + cur_freq = (int *)ret_freq; + pmspr_val = get_pmspr(SPRN_PMSR); + pstate_id = pmspr_val; + pstate_id = pstate_id >> 56; + WARN_ON(pstate_id > 0); + freq = pstate_id_to_freq(pstate_id); + pr_debug("cpu %d pmsr %lx pstate_id %ld frequency %d \n", + smp_processor_id(), pmspr_val, pstate_id, freq); + *cur_freq = freq; +} + +/** + * Returns the cpu frequency as reported by the firmware for 'cpu'. + * This value is reported through the sysfs file cpuinfo_cur_freq. + */ +unsigned int powernv_cpufreq_get(unsigned int cpu) +{ + int ret_freq; + cpumask_var_t sibling_mask; + + if (!zalloc_cpumask_var(&sibling_mask, GFP_KERNEL)) + return -ENOMEM; + + powernv_cpu_to_core_mask(cpu, sibling_mask); + smp_call_function_any(sibling_mask, powernv_get_cpu_freq, &ret_freq, 1); + + free_cpumask_var(sibling_mask); + return ret_freq; +} + static void set_pstate(void *pstate) { unsigned long val; @@ -310,6 +350,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, static struct cpufreq_driver powernv_cpufreq_driver = { .verify = powernv_cpufreq_verify, .target = powernv_cpufreq_target, + .get= powernv_cpufreq_get, .init = powernv_cpufreq_cpu_init, .exit = powernv_cpufreq_cpu_exit, .name = "powernv-cpufreq", -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 14/52] powerpc, sysfs: Fix CPU hotplug callback registration
Hello Ben, On Fri, Mar 07, 2014 at 01:57:31PM +1100, Benjamin Herrenschmidt wrote: > On Fri, 2014-02-14 at 13:22 +0530, Srivatsa S. Bhat wrote: > > Subsystems that want to register CPU hotplug callbacks, as well as perform > > initialization for the CPUs that are already online, often do it as shown > > below: > > > > get_online_cpus(); > > > > for_each_online_cpu(cpu) > > init_cpu(cpu); > > > > register_cpu_notifier(&foobar_cpu_notifier); > > > > put_online_cpus(); > > This patch breaks a good half of my test configs with: > > /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c: In function > 'topology_init': > /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:979:2: error: > implicit declaration of function 'cpu_notifier_register_begin' > [-Werror=implicit-function-declaration] > /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:1004:2: error: > implicit declaration of function '__register_cpu_notifier' > [-Werror=implicit-function-declaration] > /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:1006:2: error: > implicit declaration of function 'cpu_notifier_register_done' > [-Werror=implicit-function-declaration] > cc1: all warnings being treated as errors > make[2]: *** [arch/powerpc/kernel/sysfs.o] Error 1 > make[2]: *** Waiting for unfinished jobs This patch depends on "[PATCH v2 02/52] CPU hotplug: Provide lockless versions of callback registration functions" of the series (Can be found here: https://lkml.org/lkml/2014/2/14/59). This particular patch defines 'cpu_notifier_register_begin', '__register_cpu_notifier', and 'cpu_notifier_register_done' in the cpu-hotplug core. Are you seeing the build breakage with this patch on ? -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 0/6] powernv:cpufreq: Dynamic cpu-frequency scaling
From: "Gautham R. Shenoy" Hi, This is the v2 of the consolidated patchset consisting patches for enabling cpufreq on IBM POWERNV platforms along with some enhancements. The v1 of these patches have been previously submitted on linuxppc-dev [1][2]. - This patchset contains code for the platform driver to support CPU frequency scaling on IBM POWERNV platforms. - In addition to the standard control and status files exposed by the cpufreq core, the patchset exposes the nominal frequency through the file named "cpuinfo_nominal_freq". The patchset is based against commit c3bebc71c4bcdafa24b506adf0c1de3c1f77e2e0 of the mainline tree. [1]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-February/115244.html [2]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-March/115703.html Gautham R. Shenoy (3): powernv:cpufreq: Create pstate_id_to_freq() helper powernv:cpufreq: Export nominal frequency via sysfs. powernv:cpufreq: Implement the driver->get() method Srivatsa S. Bhat (2): powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper. powernv,cpufreq:Add per-core locking to serialize frequency transitions Vaidyanathan Srinivasan (1): powernv: cpufreq driver for powernv platform arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/platforms/powernv/Kconfig | 1 + drivers/cpufreq/Kconfig| 1 + drivers/cpufreq/Kconfig.powerpc| 13 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 397 + 6 files changed, 417 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 5/6] powernv:cpufreq: Export nominal frequency via sysfs.
From: "Gautham R. Shenoy" Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 0ecd163..183bbc4 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -142,8 +142,30 @@ static unsigned int pstate_id_to_freq(int pstate_id) return powernv_freqs[i].frequency; } +/** + * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by + * the firmware + */ +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy, + char *buf) +{ + int nominal_freq; + nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id); + return sprintf(buf, "%u\n", nominal_freq); +} + + +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = { + .attr = { .name = "cpuinfo_nominal_freq", + .mode = 0444, + }, + .show = show_cpuinfo_nominal_freq, +}; + + static struct freq_attr *powernv_cpu_freq_attr[] = { &cpufreq_freq_attr_scaling_available_freqs, + &cpufreq_freq_attr_cpuinfo_nominal_freq, NULL, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/6] powernv: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Srivatsa S. Bhat Signed-off-by: Anton Blanchard Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/platforms/powernv/Kconfig | 1 + drivers/cpufreq/Kconfig| 1 + drivers/cpufreq/Kconfig.powerpc| 13 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 277 + 6 files changed, 297 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 90c06ec..84f92ca 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 895e8a2..1fe12b1 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,6 +11,7 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM + select CPU_FREQ default y config PPC_POWERNV_RTAS diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 4b029c0..4ba1632 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS choice prompt "Default CPUFreq governor" default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ + default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ default CPU_FREQ_DEFAULT_GOV_PERFORMANCE help This option sets which CPUFreq governor shall be loaded at diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..93f8689 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,16 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate "CPU frequency scaling for IBM POWERNV platform" + depends on PPC_POWERNV + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_POWERSAVE + select CPU_FREQ_GOV_USERSPACE + select CPU_FREQ_GOV_ONDEMAND + select CPU_FREQ_GOV_CONSERVATIVE + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..ab1551f --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,277 @@ +/* + * POWERNV cpufreq driver for the IBM POWER processors + * + * (C) Copyright IBM 2014 + * + * Author: Vaidyanathan Srinivasan + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR
[PATCH v2 3/6] powernv, cpufreq:Add per-core locking to serialize frequency transitions
From: "Srivatsa S. Bhat" On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Using a global mutex lock would needlessly serialize _all_ frequency transitions in the system (across all cores). So introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 4cad727..4c2e8ca 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -24,8 +24,15 @@ #include #include -/* FIXME: Make this per-core */ -static DEFINE_MUTEX(freq_switch_mutex); +/* Per-Core locking for frequency transitions */ +static DEFINE_PER_CPU(struct mutex, freq_switch_lock); + +#define lock_core_freq(cpu)\ + mutex_lock(&per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); +#define unlock_core_freq(cpu) \ + mutex_unlock(&per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); #define POWERNV_MAX_PSTATES256 @@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, freqs.new = powernv_freqs[new_index].frequency; freqs.cpu = policy->cpu; - mutex_lock(&freq_switch_mutex); + lock_core_freq(policy->cpu); cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE); pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d", @@ -245,7 +252,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, rc = powernv_set_freq(policy->cpus, new_index); cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE); - mutex_unlock(&freq_switch_mutex); + unlock_core_freq(policy->cpu); return rc; } @@ -262,7 +269,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int __init powernv_cpufreq_init(void) { - int rc = 0; + int cpu, rc = 0; /* Discover pstates from device tree and init */ @@ -272,6 +279,10 @@ static int __init powernv_cpufreq_init(void) pr_info("powernv-cpufreq disabled\n"); return rc; } + /* Init per-core mutex */ + for_each_possible_cpu(cpu) { + mutex_init(&per_cpu(freq_switch_lock, cpu)); + } rc = cpufreq_register_driver(&powernv_cpufreq_driver); return rc; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/6] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.
From: "Srivatsa S. Bhat" Create a helper method that computes the cpumask corresponding to the thread-siblings of a cpu. Use this for initializing the policy->cpus mask for a given cpu. (Original code written by Srivatsa S. Bhat. Gautham moved this to a helper function!) Signed-off-by: Srivatsa S. Bhat Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ab1551f..4cad727 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -115,6 +115,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = { /* Helper routines */ +/** + * Sets the bits corresponding to the thread-siblings of cpu in its core + * in 'cpus'. + */ +static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus) +{ + int base, i; + + base = cpu_first_thread_sibling(cpu); + + for (i = 0; i < threads_per_core; i++) { + cpumask_set_cpu(base + i, cpus); + } + + return; +} + /* Access helpers to power mgt SPR */ static inline unsigned long get_pmspr(unsigned long sprn) @@ -180,13 +197,8 @@ static int powernv_set_freq(cpumask_var_t cpus, unsigned int new_index) static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) { - int base, i; - #ifdef CONFIG_SMP - base = cpu_first_thread_sibling(policy->cpu); - - for (i = 0; i < threads_per_core; i++) - cpumask_set_cpu(base + i, policy->cpus); + powernv_cpu_to_core_mask(policy->cpu, policy->cpus); #endif policy->cpuinfo.transition_latency = 25000; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 6/6] powernv:cpufreq: Implement the driver->get() method
From: "Gautham R. Shenoy" The current frequency of a cpu is reported through the sysfs file cpuinfo_cur_freq. This requires the driver to implement a "->get(unsigned int cpu)" method which will return the current operating frequency. Implement a function named powernv_cpufreq_get() which reads the local pstate from the PMSR and returns the corresponding frequency. Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get(). Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 48 +++ 1 file changed, 48 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 183bbc4..6f3b6e1 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -223,6 +223,53 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val) BUG(); } +/* + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; + pmspr_val = get_pmspr(SPRN_PMSR); + + /* The local pstate id corresponds bits 48..55 in the PMSR. + * Note: Watch out for the sign! */ + local_pstate_id = (pmspr_val >> 48) & 0xFF; + pstate_id = local_pstate_id; + + freq = pstate_id_to_freq(pstate_id); + pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n", + smp_processor_id(), pmspr_val, pstate_id, freq); + *cur_freq = freq; +} + +/* + * Returns the cpu frequency as reported by the firmware for 'cpu'. + * This value is reported through the sysfs file cpuinfo_cur_freq. + */ +unsigned int powernv_cpufreq_get(unsigned int cpu) +{ + int ret_freq; + cpumask_var_t sibling_mask; + + if (unlikely(!zalloc_cpumask_var(&sibling_mask, GFP_KERNEL))) { + smp_call_function_single(cpu, powernv_read_cpu_freq, + &ret_freq, 1); + return ret_freq; + } + + powernv_cpu_to_core_mask(cpu, sibling_mask); + smp_call_function_any(sibling_mask, powernv_read_cpu_freq, + &ret_freq, 1); + + free_cpumask_var(sibling_mask); + return ret_freq; +} + static void set_pstate(void *pstate) { unsigned long val; @@ -309,6 +356,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, static struct cpufreq_driver powernv_cpufreq_driver = { .verify = powernv_cpufreq_verify, .target = powernv_cpufreq_target, + .get= powernv_cpufreq_get, .init = powernv_cpufreq_cpu_init, .exit = powernv_cpufreq_cpu_exit, .name = "powernv-cpufreq", -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 4/6] powernv:cpufreq: Create pstate_id_to_freq() helper
From: "Gautham R. Shenoy" Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 4c2e8ca..0ecd163 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock); static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; +struct powernv_pstate_info { + int pstate_min_id; + int pstate_max_id; + int pstate_nominal_id; + int nr_pstates; +}; +static struct powernv_pstate_info powernv_pstate_info; + /* * Initialize the freq table based on data obtained * from the firmware passed via device-tree @@ -112,9 +120,28 @@ static int init_powernv_pstates(void) for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) pr_debug("%d: %d\n", i, powernv_freqs[i].frequency); + powernv_pstate_info.pstate_min_id = pstate_min; + powernv_pstate_info.pstate_max_id = pstate_max; + powernv_pstate_info.pstate_nominal_id = pstate_nominal; + powernv_pstate_info.nr_pstates = nr_pstates; + return 0; } +/** + * Returns the cpu frequency corresponding to the pstate_id. + */ +static unsigned int pstate_id_to_freq(int pstate_id) +{ + int i; + + i = powernv_pstate_info.pstate_max_id - pstate_id; + + BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0); + WARN_ON(powernv_pstate_ids[i] != pstate_id); + return powernv_freqs[i].frequency; +} + static struct freq_attr *powernv_cpu_freq_attr[] = { &cpufreq_freq_attr_scaling_available_freqs, NULL, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 1/2] powerpc: powernv: Framework to show the correct clock in /proc/cpuinfo
From: "Gautham R. Shenoy" Currently, the code in setup-common.c for powerpc assumes that all clock rates are same in a smp system. This value is cached in the variable named ppc_proc_freq and is the value that is reported in /proc/cpuinfo. However on the PowerNV platform, the clock rate is same only across the threads of the same core. Hence the value that is reported in /proc/cpuinfo is incorrect on PowerNV platforms. We need a better way to query and report the correct value of the processor clock in /proc/cpuinfo. The patch achieves this by creating a machdep_call named get_proc_freq() which is expected to returns the frequency in Hz. The code in show_cpuinfo() can invoke this method to display the correct clock rate on platforms that have implemented this method. On the other powerpc platforms it can use the value cached in ppc_proc_freq. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/machdep.h | 2 ++ arch/powerpc/kernel/setup-common.c | 16 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index ad3025d..eb1c6d4 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -113,6 +113,8 @@ struct machdep_calls { /* Optional, may be NULL. */ void(*show_cpuinfo)(struct seq_file *m); void(*show_percpuinfo)(struct seq_file *m, int i); + /* Returns the current operating frequency of "cpu" in Hz */ + unsigned long (*get_proc_freq)(unsigned int cpu); void(*init_IRQ)(void); diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index bc76cc6..bdd3045 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -209,6 +209,7 @@ static int show_cpuinfo(struct seq_file *m, void *v) { unsigned long cpu_id = (unsigned long)v - 1; unsigned int pvr; + unsigned long proc_freq; unsigned short maj; unsigned short min; @@ -260,12 +261,19 @@ static int show_cpuinfo(struct seq_file *m, void *v) #endif /* CONFIG_TAU */ /* -* Assume here that all clock rates are the same in a -* smp system. -- Cort +* Platforms that have variable clock rates, should implement +* the method ppc_md.get_proc_freq() that reports the clock +* rate of a given cpu. The rest can use ppc_proc_freq to +* report the clock rate that is same across all cpus. */ - if (ppc_proc_freq) + if (ppc_md.get_proc_freq) + proc_freq = ppc_md.get_proc_freq(cpu_id); + else + proc_freq = ppc_proc_freq; + + if (proc_freq) seq_printf(m, "clock\t\t: %lu.%06luMHz\n", - ppc_proc_freq / 100, ppc_proc_freq % 100); + proc_freq / 100, proc_freq % 100); if (ppc_md.show_percpuinfo != NULL) ppc_md.show_percpuinfo(m, cpu_id); -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 2/2] powerpc: powernv: Implement ppc_md.get_proc_freq()
From: "Gautham R. Shenoy" Implement a method named pnv_get_proc_freq(unsigned int cpu) which returns the current clock rate on the 'cpu' in Hz to be reported in /proc/cpuinfo. This method uses the value reported by cpufreq when such a value is sane. Otherwise it falls back to old way of reporting the clockrate, i.e. ppc_proc_freq. Set the ppc_md.get_proc_freq() hook to pnv_get_proc_freq() on the PowerNV platform. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/platforms/powernv/setup.c | 21 + 1 file changed, 21 insertions(+) diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index 110f4fb..423e36d 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -235,6 +236,25 @@ void powernv_idle(void) } } +/* + * Returns the cpu frequency for 'cpu' in Hz. This is used by + * /proc/cpuinfo + */ +unsigned long pnv_get_proc_freq(unsigned int cpu) +{ + unsigned long ret_freq; + + ret_freq = cpufreq_quick_get(cpu) * 1000ul; + + /* +* If the backend cpufreq driver does not exist, + * then fallback to old way of reporting the clockrate. +*/ + if (!ret_freq) + ret_freq = ppc_proc_freq; + return ret_freq; +} + define_machine(powernv) { .name = "PowerNV", .probe = pnv_probe, @@ -242,6 +262,7 @@ define_machine(powernv) { .setup_arch = pnv_setup_arch, .init_IRQ = pnv_init_IRQ, .show_cpuinfo = pnv_show_cpuinfo, + .get_proc_freq = pnv_get_proc_freq, .progress = pnv_progress, .machine_shutdown = pnv_shutdown, .power_save = powernv_idle, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 0/2] powernv: Show the correct clock value in /proc/cpuinfo
From: "Gautham R. Shenoy" Hi, Currently, the code in setup-common.c for powerpc assumes that all clock rates are same in a smp system. This value is cached in the variable named ppc_proc_freq and is the value that is reported in /proc/cpuinfo. However on the PowerNV platform, the clock rate is same only across the threads of the same core. Hence the value that is reported in /proc/cpuinfo is incorrect on PowerNV platforms. This patch-series fixes this problem by having /proc/cpuinfo report the value returned by cpufreq_quick_get(cpu) whenever the cpufreq backend driver is available and fallback to the old way of reporting the clock rate in its absence. These patches depend on the patches to enable dynamic cpufrequency scaling on PowerNV that can be found here: http://linuxppc.10917.n7.nabble.com/PATCH-v2-0-6-powernv-cpufreq-Dynamic-cpu-frequency-scaling-td80641.html Gautham R. Shenoy (2): powerpc: powernv: Framework to show the correct clock in /proc/cpuinfo powerpc: powernv: Implement ppc_md.get_proc_freq() arch/powerpc/include/asm/machdep.h | 2 ++ arch/powerpc/kernel/setup-common.c | 16 arch/powerpc/platforms/powernv/setup.c | 21 + 3 files changed, 35 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 2/6] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.
On Wed, Mar 19, 2014 at 12:05:20PM +0530, Srivatsa S. Bhat wrote: > On 03/19/2014 05:07 AM, Benjamin Herrenschmidt wrote: > > On Mon, 2014-03-10 at 16:40 +0530, Gautham R. Shenoy wrote: > >> From: "Srivatsa S. Bhat" > >> > >> Create a helper method that computes the cpumask corresponding to the > >> thread-siblings of a cpu. Use this for initializing the policy->cpus > >> mask for a given cpu. > >> > >> (Original code written by Srivatsa S. Bhat. Gautham moved this to a > >> helper function!) > > > > How does that differ from the existing entry in the sibling map which > > you can obtain with the helper cpu_sibling_mask() ? (You probably want > > to *copy* the mask of course but I don't see the need of re-creating > > one from scratch). > > > > The intent here was to have a sibling mask that is invariant of CPU > hotplug. cpu_sibling_mask is updated on every hotplug operation, and this > would break cpufreq, since policy->cpus has to be hotplug invariant. > > This should have been noted in the changelog of patch 1 as well as this > patch. (The earlier (internal) versions of this patchset had the > description in the changelogs, but somehow it got dropped accidentally). > I'll work with Gautham and ensure that we have this important info > included in the changelog. Thanks for pointing it out! I reused that part of the code because I was unaware of cpu_sibling_mask()! For implementing powernv_cpufreq_get(), cpu_sibling_mask() suffices since we are using this mask as a parameter to smp_call_function_any(), and hence are concerned only about the online siblings. I shall fix the changelog to reflect Srivatsa's reason for having a mask that does not vary with cpu-hotplug. > > Regards, > Srivatsa S. Bhat > > > > Also, this should have been CCed to the cpufreq mailing list and maybe > > the relevant maintainer too. Ok. Will do it in the subsequent version. Thanks for the review! > > > > Cheers, > > Ben. > > > >> Signed-off-by: Srivatsa S. Bhat > >> Signed-off-by: Gautham R. Shenoy > >> --- > >> drivers/cpufreq/powernv-cpufreq.c | 24 ++-- > >> 1 file changed, 18 insertions(+), 6 deletions(-) > >> > >> diff --git a/drivers/cpufreq/powernv-cpufreq.c > >> b/drivers/cpufreq/powernv-cpufreq.c > >> index ab1551f..4cad727 100644 > >> --- a/drivers/cpufreq/powernv-cpufreq.c > >> +++ b/drivers/cpufreq/powernv-cpufreq.c > >> @@ -115,6 +115,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = { > >> > >> /* Helper routines */ > >> > >> +/** > >> + * Sets the bits corresponding to the thread-siblings of cpu in its core > >> + * in 'cpus'. > >> + */ > >> +static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus) > >> +{ > >> + int base, i; > >> + > >> + base = cpu_first_thread_sibling(cpu); > >> + > >> + for (i = 0; i < threads_per_core; i++) { > >> + cpumask_set_cpu(base + i, cpus); > >> + } > >> + > >> + return; > >> +} > >> + > >> /* Access helpers to power mgt SPR */ > >> > >> static inline unsigned long get_pmspr(unsigned long sprn) > >> @@ -180,13 +197,8 @@ static int powernv_set_freq(cpumask_var_t cpus, > >> unsigned int new_index) > >> > >> static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) > >> { > >> - int base, i; > >> - > >> #ifdef CONFIG_SMP > >> - base = cpu_first_thread_sibling(policy->cpu); > >> - > >> - for (i = 0; i < threads_per_core; i++) > >> - cpumask_set_cpu(base + i, policy->cpus); > >> + powernv_cpu_to_core_mask(policy->cpu, policy->cpus); > >> #endif > >>policy->cpuinfo.transition_latency = 25000; > >> > > > > > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 0/5] powernv: Enable Dynamic Frequency
From: "Gautham R. Shenoy" Hi, This is the v3 of the consolidated patchset consisting patches for enabling cpufreq on IBM POWERNV platforms along with some enhancements. I have incorporated the review for v2 (which can be found at [1]). - This patchset contains code for the platform driver to support CPU frequency scaling on IBM POWERNV platforms. - In addition to the standard control and status files exposed by the cpufreq core, the patchset exposes the nominal frequency through the file named "cpuinfo_nominal_freq". The changes from from v2: - Updated the changelog for "powernv: cpufreq driver for powernv platform" to record the fact that policy->cpus must be populated with a sibling mask that includes even the offlined siblings. - Dropped the patch named "powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper" since the driver->get() method can use cpu_sibling_mask(). - Updated "powernv:cpufreq: Implement the driver->get() method" to use cpu_sibling_mask() in powernv_cpufreq_get() The patchset is based against commit 4907cdca7210c5895311bddcf05a4c85b67d8566 of the mainline tree. [1]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-March/115861.html Gautham R. Shenoy (3): powernv:cpufreq: Create pstate_id_to_freq() helper powernv:cpufreq: Export nominal frequency via sysfs. powernv:cpufreq: Implement the driver->get() method Srivatsa S. Bhat (1): powernv,cpufreq:Add per-core locking to serialize frequency transitions Vaidyanathan Srinivasan (1): powernv: cpufreq driver for powernv platform arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/platforms/powernv/Kconfig | 1 + drivers/cpufreq/Kconfig| 1 + drivers/cpufreq/Kconfig.powerpc| 13 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 375 + 6 files changed, 395 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/5] powernv: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The policy->cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy->related_cpus mask which is should not vary on cpu-hotplug. [Change log updated by e...@linux.vnet.ibm.com] Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Srivatsa S. Bhat Signed-off-by: Anton Blanchard Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/platforms/powernv/Kconfig | 1 + drivers/cpufreq/Kconfig| 1 + drivers/cpufreq/Kconfig.powerpc| 13 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 277 + 6 files changed, 297 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 90c06ec..84f92ca 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 895e8a2..1fe12b1 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,6 +11,7 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM + select CPU_FREQ default y config PPC_POWERNV_RTAS diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 4b029c0..4ba1632 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS choice prompt "Default CPUFreq governor" default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ + default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ default CPU_FREQ_DEFAULT_GOV_PERFORMANCE help This option sets which CPUFreq governor shall be loaded at diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..93f8689 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,16 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate "CPU frequency scaling for IBM POWERNV platform" + depends on PPC_POWERNV + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_POWERSAVE + select CPU_FREQ_GOV_USERSPACE + select CPU_FREQ_GOV_ONDEMAND + select CPU_FREQ_GOV_CONSERVATIVE + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..ab1551f --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,277 @@ +/* + * POWERNV cpufreq driver for the IBM POWER processors + * + * (C) Copyright IBM 2014 + * + * Author: Vaidyanathan Srinivasan + * + * This program is free software; you can redistribute it and/or modify + * it under the terms
[PATCH v3 2/5] powernv, cpufreq:Add per-core locking to serialize frequency transitions
From: "Srivatsa S. Bhat" On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Using a global mutex lock would needlessly serialize _all_ frequency transitions in the system (across all cores). So introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ab1551f..66dae0d 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -24,8 +24,15 @@ #include #include -/* FIXME: Make this per-core */ -static DEFINE_MUTEX(freq_switch_mutex); +/* Per-Core locking for frequency transitions */ +static DEFINE_PER_CPU(struct mutex, freq_switch_lock); + +#define lock_core_freq(cpu)\ + mutex_lock(&per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); +#define unlock_core_freq(cpu) \ + mutex_unlock(&per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); #define POWERNV_MAX_PSTATES256 @@ -221,7 +228,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, freqs.new = powernv_freqs[new_index].frequency; freqs.cpu = policy->cpu; - mutex_lock(&freq_switch_mutex); + lock_core_freq(policy->cpu); cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE); pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d", @@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, rc = powernv_set_freq(policy->cpus, new_index); cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE); - mutex_unlock(&freq_switch_mutex); + unlock_core_freq(policy->cpu); return rc; } @@ -250,7 +257,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int __init powernv_cpufreq_init(void) { - int rc = 0; + int cpu, rc = 0; /* Discover pstates from device tree and init */ @@ -260,6 +267,10 @@ static int __init powernv_cpufreq_init(void) pr_info("powernv-cpufreq disabled\n"); return rc; } + /* Init per-core mutex */ + for_each_possible_cpu(cpu) { + mutex_init(&per_cpu(freq_switch_lock, cpu)); + } rc = cpufreq_register_driver(&powernv_cpufreq_driver); return rc; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 3/5] powernv:cpufreq: Create pstate_id_to_freq() helper
From: "Gautham R. Shenoy" Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 66dae0d..e7b0292 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock); static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; +struct powernv_pstate_info { + int pstate_min_id; + int pstate_max_id; + int pstate_nominal_id; + int nr_pstates; +}; +static struct powernv_pstate_info powernv_pstate_info; + /* * Initialize the freq table based on data obtained * from the firmware passed via device-tree @@ -112,9 +120,28 @@ static int init_powernv_pstates(void) for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) pr_debug("%d: %d\n", i, powernv_freqs[i].frequency); + powernv_pstate_info.pstate_min_id = pstate_min; + powernv_pstate_info.pstate_max_id = pstate_max; + powernv_pstate_info.pstate_nominal_id = pstate_nominal; + powernv_pstate_info.nr_pstates = nr_pstates; + return 0; } +/** + * Returns the cpu frequency corresponding to the pstate_id. + */ +static unsigned int pstate_id_to_freq(int pstate_id) +{ + int i; + + i = powernv_pstate_info.pstate_max_id - pstate_id; + + BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0); + WARN_ON(powernv_pstate_ids[i] != pstate_id); + return powernv_freqs[i].frequency; +} + static struct freq_attr *powernv_cpu_freq_attr[] = { &cpufreq_freq_attr_scaling_available_freqs, NULL, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 4/5] powernv:cpufreq: Export nominal frequency via sysfs.
From: "Gautham R. Shenoy" Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index e7b0292..46bee8a 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -142,8 +142,30 @@ static unsigned int pstate_id_to_freq(int pstate_id) return powernv_freqs[i].frequency; } +/** + * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by + * the firmware + */ +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy, + char *buf) +{ + int nominal_freq; + nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id); + return sprintf(buf, "%u\n", nominal_freq); +} + + +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = { + .attr = { .name = "cpuinfo_nominal_freq", + .mode = 0444, + }, + .show = show_cpuinfo_nominal_freq, +}; + + static struct freq_attr *powernv_cpu_freq_attr[] = { &cpufreq_freq_attr_scaling_available_freqs, + &cpufreq_freq_attr_cpuinfo_nominal_freq, NULL, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method
From: "Gautham R. Shenoy" The current frequency of a cpu is reported through the sysfs file cpuinfo_cur_freq. This requires the driver to implement a "->get(unsigned int cpu)" method which will return the current operating frequency. Implement a function named powernv_cpufreq_get() which reads the local pstate from the PMSR and returns the corresponding frequency. Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get(). Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 46bee8a..ef6ed8c 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val) BUG(); } +/* + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; + pmspr_val = get_pmspr(SPRN_PMSR); + + /* The local pstate id corresponds bits 48..55 in the PMSR. + * Note: Watch out for the sign! */ + local_pstate_id = (pmspr_val >> 48) & 0xFF; + pstate_id = local_pstate_id; + + freq = pstate_id_to_freq(pstate_id); + pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n", + smp_processor_id(), pmspr_val, pstate_id, freq); + *cur_freq = freq; +} + +/* + * Returns the cpu frequency as reported by the firmware for 'cpu'. + * This value is reported through the sysfs file cpuinfo_cur_freq. + */ +unsigned int powernv_cpufreq_get(unsigned int cpu) +{ + int ret_freq; + + smp_call_function_any(cpu_sibling_mask(cpu), powernv_read_cpu_freq, + &ret_freq, 1); + return ret_freq; +} + static void set_pstate(void *pstate) { unsigned long val; @@ -297,6 +334,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, static struct cpufreq_driver powernv_cpufreq_driver = { .verify = powernv_cpufreq_verify, .target = powernv_cpufreq_target, + .get= powernv_cpufreq_get, .init = powernv_cpufreq_cpu_init, .exit = powernv_cpufreq_cpu_exit, .name = "powernv-cpufreq", -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/5] powernv, cpufreq:Add per-core locking to serialize frequency transitions
On Thu, Mar 20, 2014 at 05:40:57PM +0530, Gautham R. Shenoy wrote: > From: "Srivatsa S. Bhat" > > On POWER systems, the CPU frequency is controlled at a core-level and > hence we need to serialize so that only one of the threads in the core > switches the core's frequency at a time. > > Using a global mutex lock would needlessly serialize _all_ frequency > transitions in the system (across all cores). So introduce per-core > locking to enable finer-grained synchronization and thereby enhance > the speed and responsiveness of the cpufreq driver to varying workload > demands. > > The design of per-core locking is very simple and straight-forward: we > first define a Per-CPU lock and use the ones that belongs to the first > thread sibling of the core. > > cpu_first_thread_sibling() macro is used to find the *common* lock for > all thread siblings belonging to a core. > Forgot to add the following line. Reviewed-by: Preeti U Murthy > Signed-off-by: Srivatsa S. Bhat > Signed-off-by: Vaidyanathan Srinivasan > Signed-off-by: Gautham R. Shenoy > --- > drivers/cpufreq/powernv-cpufreq.c | 21 - > 1 file changed, 16 insertions(+), 5 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index ab1551f..66dae0d 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -24,8 +24,15 @@ > #include > #include > > -/* FIXME: Make this per-core */ > -static DEFINE_MUTEX(freq_switch_mutex); > +/* Per-Core locking for frequency transitions */ > +static DEFINE_PER_CPU(struct mutex, freq_switch_lock); > + > +#define lock_core_freq(cpu) \ > + mutex_lock(&per_cpu(freq_switch_lock,\ > + cpu_first_thread_sibling(cpu))); > +#define unlock_core_freq(cpu)\ > + mutex_unlock(&per_cpu(freq_switch_lock,\ > + cpu_first_thread_sibling(cpu))); > > #define POWERNV_MAX_PSTATES 256 > > @@ -221,7 +228,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy > *policy, > freqs.new = powernv_freqs[new_index].frequency; > freqs.cpu = policy->cpu; > > - mutex_lock(&freq_switch_mutex); > + lock_core_freq(policy->cpu); > cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE); > > pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d", > @@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy > *policy, > rc = powernv_set_freq(policy->cpus, new_index); > > cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE); > - mutex_unlock(&freq_switch_mutex); > + unlock_core_freq(policy->cpu); > > return rc; > } > @@ -250,7 +257,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = { > > static int __init powernv_cpufreq_init(void) > { > - int rc = 0; > + int cpu, rc = 0; > > /* Discover pstates from device tree and init */ > > @@ -260,6 +267,10 @@ static int __init powernv_cpufreq_init(void) > pr_info("powernv-cpufreq disabled\n"); > return rc; > } > + /* Init per-core mutex */ > + for_each_possible_cpu(cpu) { > + mutex_init(&per_cpu(freq_switch_lock, cpu)); > + } > > rc = cpufreq_register_driver(&powernv_cpufreq_driver); > return rc; > -- > 1.8.3.1 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 3/5] powernv:cpufreq: Create pstate_id_to_freq() helper
On Thu, Mar 20, 2014 at 05:40:58PM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" > > Create a helper routine that can return the cpu-frequency for the > corresponding pstate_id. > > Also, cache the values of the pstate_max, pstate_min and > pstate_nominal and nr_pstates in a static structure so that they can > be reused in the future to perform any validations. > Forgot to add the following line: Reviewed-by: Preeti U Murthy > Signed-off-by: Gautham R. Shenoy > --- > drivers/cpufreq/powernv-cpufreq.c | 27 +++ > 1 file changed, 27 insertions(+) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 66dae0d..e7b0292 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock); > static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; > static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; > > +struct powernv_pstate_info { > + int pstate_min_id; > + int pstate_max_id; > + int pstate_nominal_id; > + int nr_pstates; > +}; > +static struct powernv_pstate_info powernv_pstate_info; > + > /* > * Initialize the freq table based on data obtained > * from the firmware passed via device-tree > @@ -112,9 +120,28 @@ static int init_powernv_pstates(void) > for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) > pr_debug("%d: %d\n", i, powernv_freqs[i].frequency); > > + powernv_pstate_info.pstate_min_id = pstate_min; > + powernv_pstate_info.pstate_max_id = pstate_max; > + powernv_pstate_info.pstate_nominal_id = pstate_nominal; > + powernv_pstate_info.nr_pstates = nr_pstates; > + > return 0; > } > > +/** > + * Returns the cpu frequency corresponding to the pstate_id. > + */ > +static unsigned int pstate_id_to_freq(int pstate_id) > +{ > + int i; > + > + i = powernv_pstate_info.pstate_max_id - pstate_id; > + > + BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0); > + WARN_ON(powernv_pstate_ids[i] != pstate_id); > + return powernv_freqs[i].frequency; > +} > + > static struct freq_attr *powernv_cpu_freq_attr[] = { > &cpufreq_freq_attr_scaling_available_freqs, > NULL, > -- > 1.8.3.1 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method
On Thu, Mar 20, 2014 at 05:41:00PM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" > > The current frequency of a cpu is reported through the sysfs file > cpuinfo_cur_freq. This requires the driver to implement a > "->get(unsigned int cpu)" method which will return the current > operating frequency. > > Implement a function named powernv_cpufreq_get() which reads the local > pstate from the PMSR and returns the corresponding frequency. > > Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get(). Forgot to add Reviewed-by: Preeti U Murthy > > Signed-off-by: Gautham R. Shenoy > --- > drivers/cpufreq/powernv-cpufreq.c | 38 ++ > 1 file changed, 38 insertions(+) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 46bee8a..ef6ed8c 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, > unsigned long val) > BUG(); > } > > +/* > + * Computes the current frequency on this cpu > + * and stores the result in *ret_freq. > + */ > +static void powernv_read_cpu_freq(void *ret_freq) > +{ > + unsigned long pmspr_val; > + s8 local_pstate_id; > + int *cur_freq, freq, pstate_id; > + > + cur_freq = (int *)ret_freq; > + pmspr_val = get_pmspr(SPRN_PMSR); > + > + /* The local pstate id corresponds bits 48..55 in the PMSR. > + * Note: Watch out for the sign! */ > + local_pstate_id = (pmspr_val >> 48) & 0xFF; > + pstate_id = local_pstate_id; > + > + freq = pstate_id_to_freq(pstate_id); > + pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n", > + smp_processor_id(), pmspr_val, pstate_id, freq); > + *cur_freq = freq; > +} > + > +/* > + * Returns the cpu frequency as reported by the firmware for 'cpu'. > + * This value is reported through the sysfs file cpuinfo_cur_freq. > + */ > +unsigned int powernv_cpufreq_get(unsigned int cpu) > +{ > + int ret_freq; > + > + smp_call_function_any(cpu_sibling_mask(cpu), powernv_read_cpu_freq, > + &ret_freq, 1); > + return ret_freq; > +} > + > static void set_pstate(void *pstate) > { > unsigned long val; > @@ -297,6 +334,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy > *policy, > static struct cpufreq_driver powernv_cpufreq_driver = { > .verify = powernv_cpufreq_verify, > .target = powernv_cpufreq_target, > + .get= powernv_cpufreq_get, > .init = powernv_cpufreq_cpu_init, > .exit = powernv_cpufreq_cpu_exit, > .name = "powernv-cpufreq", > -- > 1.8.3.1 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/5] powernv: Enable Dynamic Frequency
Hi Viresh, On Fri, Mar 21, 2014 at 01:16:03PM +0530, Viresh Kumar wrote: > On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy > wrote: > > This is the v3 of the consolidated patchset consisting > > patches for enabling cpufreq on IBM POWERNV platforms > > along with some enhancements. > > This is the first time I saw them. Looks like you never Cc'd linux-pm list. > Also, would be better to keep Maintainers in --to field so that they can > review them as soon as possible. Otherwise they might stay on lists for > a long time.. Yes, I hadn't Cc'ed linux-pm list last time around. Sorry about that. Will keep this in mind the next time around. Also, your suggestion regarding keeping the Maintainers in --to field is well taken. -- Thanks and Regards gautham. > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 4/5] powernv:cpufreq: Export nominal frequency via sysfs.
On Fri, Mar 21, 2014 at 02:17:28PM +0530, Viresh Kumar wrote: > > +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy, > > + char *buf) > > +{ > > + int nominal_freq; > > + nominal_freq = > > pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id); > > + return sprintf(buf, "%u\n", nominal_freq); > > return sprintf(buf, "%u\n", > pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id)); Sure. Will fix this. > > ?? > > > +} > > + > > + > > remove extra blank line. > > > +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = { > > + .attr = { .name = "cpuinfo_nominal_freq", > > + .mode = 0444, > > + }, > > Align {} Probably the use of ATTR_RO(cpuinfo_nominal_freq) and renaming show_cpuinfo_nominal_freq to cpuinfo_nominal_freq_show() would be even better. What do you think ? > > > + .show = show_cpuinfo_nominal_freq, > > +}; > > + > > + > > static struct freq_attr *powernv_cpu_freq_attr[] = { > > &cpufreq_freq_attr_scaling_available_freqs, > > + &cpufreq_freq_attr_cpuinfo_nominal_freq, > > NULL, > > }; This needs to be rewritten to include the entries of cpufreq_generic_attr. > Thanks for the review. -- Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform
Hi Viresh, On Fri, Mar 21, 2014 at 02:11:32PM +0530, Viresh Kumar wrote: > On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy > wrote: > > From: Vaidyanathan Srinivasan > > Hi Vaidy, > > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > > index 4b029c0..4ba1632 100644 > > --- a/drivers/cpufreq/Kconfig > > +++ b/drivers/cpufreq/Kconfig > > @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS > > choice > > prompt "Default CPUFreq governor" > > default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || > > ARM_SA1110_CPUFREQ > > + default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ > > Probably we should remove SA1100's entry as well from here. This is > not the right way of doing it. Imagine 100 platforms having entries here. > If you want it, then select it from your platforms Kconfig. Sure. Will move these bits and the other governor related bits to the Powernv Kconfig. > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > > b/drivers/cpufreq/powernv-cpufreq.c > > new file mode 100644 > > index 000..ab1551f > > --- /dev/null > > + > > +#define pr_fmt(fmt)"powernv-cpufreq: " fmt > > + > > +#include > > +#include > > +#include > > +#include > > That's it? Sure? > > Even if things compile for you, you must explicitly include all the > files on which > you depend. Ok. > > > + > > + WARN_ON(len_ids != len_freqs); > > + nr_pstates = min(len_ids, len_freqs) / sizeof(u32); > > + WARN_ON(!nr_pstates); > > Why do you want to continue here? Good point. We might be better off exiting at this point. > > > + pr_debug("NR PStates %d\n", nr_pstates); > > + for (i = 0; i < nr_pstates; i++) { > > + u32 id = be32_to_cpu(pstate_ids[i]); > > + u32 freq = be32_to_cpu(pstate_freqs[i]); > > + > > + pr_debug("PState id %d freq %d MHz\n", id, freq); > > + powernv_freqs[i].driver_data = i; > > I don't think you are using this field at all and this is the field you can > use for driver_data and so you can get rid of powernv_pstate_ids[ ]. Using driver_data to record powernv_pstate_ids won't work since powernv_pstate_ids can be negative. So a pstate_id -3 can be confused with CPUFREQ_BOOST_FREQ thereby not displaying the frequency corresponding to pstate id -3. So for now I think we will be retaining powernv_pstate_ids. > > > + powernv_freqs[i].frequency = freq * 1000; /* kHz */ > > + powernv_pstate_ids[i] = id; > > + } > > + /* End of list marker entry */ > > + powernv_freqs[i].driver_data = 0; > > Not required. Ok. > > > + powernv_freqs[i].frequency = CPUFREQ_TABLE_END; > > + > > + /* Print frequency table */ > > + for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) > > + pr_debug("%d: %d\n", i, powernv_freqs[i].frequency); > > You have already printed this table earlier.. Fair enough. > > > + > > + return 0; > > +} > > + > > +static struct freq_attr *powernv_cpu_freq_attr[] = { > > + &cpufreq_freq_attr_scaling_available_freqs, > > + NULL, > > +}; > > Can use this instead: cpufreq_generic_attr? In this patch yes. But later patch introduces an additional attribute for displaying the nominal frequency. Will handle that part in a clean way in the next version. > > > +/* Helper routines */ > > + > > +/* Access helpers to power mgt SPR */ > > + > > +static inline unsigned long get_pmspr(unsigned long sprn) > > Looks big enough not be inlined? It is called from one function. It has been defined separately for readability. > > > +{ > > + switch (sprn) { > > + case SPRN_PMCR: > > + return mfspr(SPRN_PMCR); > > + > > + case SPRN_PMICR: > > + return mfspr(SPRN_PMICR); > > + > > + case SPRN_PMSR: > > + return mfspr(SPRN_PMSR); > > + } > > + BUG(); > > +} > > + > > +static inline void set_pmspr(unsigned long sprn, unsigned long val) > > +{ > > Same here.. Same reason as above. > > > + switch (sprn) { > > + case SPRN_PMCR: > > + mtspr(SPRN_PMCR, val); > > + return; > > + > > + case SPRN_PMICR: > > + mtspr(SPRN_PMICR, val); > > + return;
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method
On Fri, Mar 21, 2014 at 03:01:29PM +0530, Viresh Kumar wrote: > On Thu, Mar 20, 2014 at 5:41 PM, Gautham R. Shenoy > wrote: > > From: "Gautham R. Shenoy" > > > > The current frequency of a cpu is reported through the sysfs file > > cpuinfo_cur_freq. This requires the driver to implement a > > "->get(unsigned int cpu)" method which will return the current > > operating frequency. > > > > Implement a function named powernv_cpufreq_get() which reads the local > > pstate from the PMSR and returns the corresponding frequency. > > > > Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get(). > > > > Signed-off-by: Gautham R. Shenoy > > --- > > drivers/cpufreq/powernv-cpufreq.c | 38 > > ++ > > 1 file changed, 38 insertions(+) > > Please merge these fixups with the first patch which is creating the driver. > I understand that a different guy has created this patch and so wanted > to have a patch on his name but its really difficult to review this way. Heh! Well, that wasn't the reason why this was sent out as a separate patch, but never mind. Though I don't understand why it would be difficult to review the patch though. > Better add your signed-off in the first patch instead. Sending such changes > once the driver is mainlined looks fine. Sure, this makes sense. > > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > > b/drivers/cpufreq/powernv-cpufreq.c > > index 46bee8a..ef6ed8c 100644 > > --- a/drivers/cpufreq/powernv-cpufreq.c > > +++ b/drivers/cpufreq/powernv-cpufreq.c > > @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, > > unsigned long val) > > BUG(); > > } > > > > +/* > > + * Computes the current frequency on this cpu > > + * and stores the result in *ret_freq. > > + */ > > +static void powernv_read_cpu_freq(void *ret_freq) > > +{ > > + unsigned long pmspr_val; > > + s8 local_pstate_id; > > + int *cur_freq, freq, pstate_id; > > + > > + cur_freq = (int *)ret_freq; > > You don't need cur_freq variable at all.. I don't like it either. But the compiler complains without this hack :-( > > > + pmspr_val = get_pmspr(SPRN_PMSR); > > + > > + /* The local pstate id corresponds bits 48..55 in the PMSR. > > + * Note: Watch out for the sign! */ > > + local_pstate_id = (pmspr_val >> 48) & 0xFF; > > + pstate_id = local_pstate_id; > > similarly local_pstate_id well, I am interested in the bits 48..55 of pmspr_val. That's the pstate_id which can be negative. So I'll like to keep local_pstate_id. > > > + > > + freq = pstate_id_to_freq(pstate_id); > > + pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n", > > + smp_processor_id(), pmspr_val, pstate_id, freq); > > + *cur_freq = freq; > > Move above print here after a blank line. Also remove freq variable as > well and use *cur_freq directly.. And then you can rename it to freq as well. We can get rid of freq and use *cur_freq in its place. Thanks for the detailed review. -- Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method
On Fri, Mar 21, 2014 at 12:01:07PM +, David Laight wrote: > From: Viresh Kumar > > > On 21 March 2014 16:34, Gautham R Shenoy wrote: > > > Heh! Well, that wasn't the reason why this was sent out as a separate > > > patch, but never mind. Though I don't understand why it would be > > > difficult to review the patch though. > > > > Because the initial driver wasn't complete earlier. There were 2-3 patches > > after the first one which are changing what the first patch has added. > > Nothing else :) > > > > >> > +static void powernv_read_cpu_freq(void *ret_freq) > > >> > +{ > > >> > + unsigned long pmspr_val; > > >> > + s8 local_pstate_id; > > >> > + int *cur_freq, freq, pstate_id; > > >> > + > > >> > + cur_freq = (int *)ret_freq; > > >> > > >> You don't need cur_freq variable at all.. > > > > > > I don't like it either. But the compiler complains without this hack > > > :-( > > > > Why would the compiler warn for doing this?: > > > > *(int *)ret_freq = freq; > > Because it is very likely to be wrong. > In general casts of pointers to integer types are dangerous. > In this case why not make the function return the value? Because this function is called via an smp_call_function(). And we need a way of returning the value to the caller. > > David > > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method
On Fri, Mar 21, 2014 at 05:15:34PM +0530, Viresh Kumar wrote: > On 21 March 2014 16:34, Gautham R Shenoy wrote: > > Heh! Well, that wasn't the reason why this was sent out as a separate > > patch, but never mind. Though I don't understand why it would be > > difficult to review the patch though. > > Because the initial driver wasn't complete earlier. There were 2-3 patches > after the first one which are changing what the first patch has added. > Nothing else :) > > >> > +static void powernv_read_cpu_freq(void *ret_freq) > >> > +{ > >> > + unsigned long pmspr_val; > >> > + s8 local_pstate_id; > >> > + int *cur_freq, freq, pstate_id; > >> > + > >> > + cur_freq = (int *)ret_freq; > >> > >> You don't need cur_freq variable at all.. > > > > I don't like it either. But the compiler complains without this hack > > :-( > > Why would the compiler warn for doing this?: > > *(int *)ret_freq = freq; The compiler doesn't complain if I do this. > > >> > + pmspr_val = get_pmspr(SPRN_PMSR); > >> > + > >> > + /* The local pstate id corresponds bits 48..55 in the PMSR. > >> > + * Note: Watch out for the sign! */ > >> > + local_pstate_id = (pmspr_val >> 48) & 0xFF; > >> > + pstate_id = local_pstate_id; > >> > >> similarly local_pstate_id > > > > well, I am interested in the bits 48..55 of pmspr_val. That's the > > pstate_id which can be negative. So I'll like to keep > > local_pstate_id. > > Can you please explain at bit level how getting the value in a s8 > first and then into a s32 will help you here? Instead of getting it > directly into s32. Consider the case when pmspr = 0x00fe; We are interested in extracting the value 'fe'. And ensure that when we store this value into an int, we get the sign extension right. So the following doesn't work: pstate_id = (pmspr_val >> 48) & 0x; Since we get pstate_id = 0x00fe, which is not the same as 0xfffe. Of course, we could do the following, but I wasn't sure if two levels of type casting helped on the readability front. pstate_id = (int) ((s8)((pmspr_val >> 48) & 0xFF)); -- Thanks and Regards gautham. > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform
Hi Viresh, On Fri, Mar 21, 2014 at 04:24:27PM +0530, Viresh Kumar wrote: > On 21 March 2014 16:13, Gautham R Shenoy wrote: > > On Fri, Mar 21, 2014 at 02:11:32PM +0530, Viresh Kumar wrote: > > >> > + pr_debug("PState id %d freq %d MHz\n", id, freq); > >> > + powernv_freqs[i].driver_data = i; > >> > >> I don't think you are using this field at all and this is the field you can > >> use for driver_data and so you can get rid of powernv_pstate_ids[ ]. > > > > Using driver_data to record powernv_pstate_ids won't work since > > powernv_pstate_ids can be negative. So a pstate_id -3 can be confused > > with CPUFREQ_BOOST_FREQ thereby not displaying the frequency > > corresponding to pstate id -3. So for now I think we will be retaining > > powernv_pstate_ids. > > As I said earlier, this field is only used by cpufreq drivers and cpufreq core > isn't supposed to touch it. CPUFREQ_BOOST_FREQ and other macros > are there for .frequency field and not this one. > Ok, I had based my code on linus's git tree. I checked the 'pm-cpufreq' branch of Rafael's 'linux-pm' tree and freq_table.c contains the following code snippet in show_available_frequencies: if (show_boost ^ (table[i].driver_data == CPUFREQ_BOOST_FREQ)) continue; I suspect we're talking about different code bases. So could you please tell me which one should I be looking at ? > > >> > +static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) > >> > +{ > >> > + int base, i; > >> > + > >> > +#ifdef CONFIG_SMP > >> > >> What will break if you don't have this ifdef here? Without that as well > >> below code should work? > > Missed this one? > > >> > + base = cpu_first_thread_sibling(policy->cpu); > >> > + > >> > + for (i = 0; i < threads_per_core; i++) > >> > + cpumask_set_cpu(base + i, policy->cpus); > >> > +#endif > >> > + policy->cpuinfo.transition_latency = 25000; > >> > + > >> > + policy->cur = powernv_freqs[0].frequency; > >> > + cpufreq_frequency_table_get_attr(powernv_freqs, policy->cpu); > >> > >> This doesn't exist anymore. > > > > Didn't get this comment! > > cpufreq_frequency_table_get_attr() routine doesn't exist anymore. > > >> > +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy) > >> > +{ > >> > + cpufreq_frequency_table_put_attr(policy->cpu); > >> > + return 0; > >> > +} > >> > >> You don't need this.. > > > > Why not ? > > Because cpufreq_frequency_table_get_attr() and > cpufreq_frequency_table_put_attr() don't exist anymore :) Well, it does in the codebases that I was looking at. May be I've been looking at the wrong place. > > >> > +module_init(powernv_cpufreq_init); > >> > +module_exit(powernv_cpufreq_exit); > >> > >> Place these right after the routines without any blank lines in > > between. > > > > Is this the new convention ? > > Don't know I have been following this since long time, probably from the > time I started with Mainline stuff.. I have seen many maintainers advising > this > as it make code more readable, specially if these routines are quite big.. > > Probably it isn't mentioned in coding guidelines but people follow it most of > the times. Ok. I was not aware of this. Good to know :-) > -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method
On Fri, Mar 21, 2014 at 06:42:50PM +0530, Viresh Kumar wrote: > On 21 March 2014 18:34, Gautham R Shenoy wrote: > > Consider the case when pmspr = 0x00fe; > > > > We are interested in extracting the value 'fe'. And ensure that when > > we store this value into an int, we get the sign extension right. > > > > So the following doesn't work: > > > >pstate_id = (pmspr_val >> 48) & 0x; > > What about pstate_id = (pmspr_val >> 48) & 0xFF; ?? Nope. Suppose the pmspr_val is contained in the register %rax, and pstate_id corresponds to the address -20(%rbp) then: pstate_id = (pmspr_val >> 48) & 0xFF; would corrspond to shrq$48, %rax// Left shift by 48 bits andl$255, %eax // Mask the lower 32 bits of %rax with 0x00FF movl%eax, -20(%rbp) // Store the lower 32 bits of %rax into pstate_id On the other hand, pstate_id = (int)((s8)((pmspr_val >> 48) & 0xFF)); would correspond to: shrq$48, %rax // Left shift by 48 bits. movsbl %al, %eax // Move the lower 8 bits of %rax to %eax with sign-extension. movl%eax, -20(%rbp) // store the result in pstate_id; So with this, we are getting the sign extension due to the use of movsbl. And if local_pstate_id corresponds to the address -1(%rbp) then: local_pstate_id = (pmspr_val >> 48) & 0xFF; pstate_id = local_pstate_id; would correspond to: shrq$48, %rax// Left shift by 48 bits movb%al, -1(%rbp)//copy the lower 8 bits to local_pstate_id movsbl -1(%rbp), %eax //move the contents of local_pstate_id to %eax with sign extension. movl%eax, -20(%rbp) // Store the result in pstate_id Hope this helps :-) -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform
On Fri, Mar 21, 2014 at 04:24:27PM +0530, Viresh Kumar wrote: > >> > +static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) > >> > +{ > >> > + int base, i; > >> > + > >> > + base = cpu_first_thread_sibling(policy->cpu); > >> > + > >> > + for (i = 0; i < threads_per_core; i++) > >> > + cpumask_set_cpu(base + i, policy->cpus); > >> > + policy->cpuinfo.transition_latency = 25000; > >> > + > >> > + policy->cur = powernv_freqs[0].frequency; > >> > + cpufreq_frequency_table_get_attr(powernv_freqs, policy->cpu); > >> > >> This doesn't exist anymore. Ok, I guess the right thing to do at this point is call cpufreq_table_validate_and_show(policy, powernv_freqs); Will fix the code to take care of this. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method
Hi Ben, On Sat, Mar 22, 2014 at 09:56:30AM +1100, Benjamin Herrenschmidt wrote: > On Fri, 2014-03-21 at 16:34 +0530, Gautham R Shenoy wrote: > > > > > > > > > +/* > > > > + * Computes the current frequency on this cpu > > > > + * and stores the result in *ret_freq. > > > > + */ > > > > +static void powernv_read_cpu_freq(void *ret_freq) > > > > +{ > > > > + unsigned long pmspr_val; > > > > + s8 local_pstate_id; > > > > + int *cur_freq, freq, pstate_id; > > > > + > > > > + cur_freq = (int *)ret_freq; > > > > > > You don't need cur_freq variable at all.. > > > > I don't like it either. But the compiler complains without this hack > > :-( > > Casting integers into void * is a recipe for disaster... what is that > supposed to be about ? Like I mentioned elsewhere on this thread, we're calling powernv_read_cpu_freq via an smp_call_function(). We use this to obtain the frequency on the cpu where powernv_read_cpu_freq executes and return it to the caller of smp_call_function. > We lose all type checking and get exposed > to endian issues etc... the day somebody uses a different type on both > sides. > Yes, I understand the problem now. I'll think of a safer way to pass the return value. > Also is "freq" a frequency ? In this case an int isn't big enough. freq is the frequency stored in the cpufreq_table. The value is in kHz. So, int should be big enough. > Cheers, > Ben. > > -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4] powernv: Dynamic Frequency Scaling Enablement
From: "Gautham R. Shenoy" Hi, This is the v4 of the patchset to enable Dynamic Frequency Scaling on IBM PowerNV Platforms. I have incorporated the review comments from the previous version (can be found at [1]). In this version, * Multiple patches from the previous version have been into a single patch, since the higher numbered patches implemented some helper functions and the driver methods which should have been a part of the first patch to begin with. * Use the generic helpers provided by the cpufreq core available in the latest linux-next tree. * Fix the code to avoid casting integer pointers to void. The patch is based on top of the commit:06ed26d1de59ce7cbbe68378b7e470be169750e5 of the linux-next tree. [1]: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg76675.html Vaidyanathan Srinivasan (1): powernv, cpufreq: cpufreq driver for powernv platform arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/include/asm/reg.h| 4 + arch/powerpc/platforms/powernv/Kconfig| 6 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 372 ++ 7 files changed, 393 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The code was originally written by sva...@linux.vnet.ibm.com. Over time it was modified to accomodate bug-fixes as well as updates to the the cpu-freq core. Relevant portions of the change logs corresponding to those modifications are noted below: * The policy->cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy->related_cpus mask which is should not vary on cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com] * On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. [Authored by srivatsa.b...@linux.vnet.ibm.com] * Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. [Authored by e...@linux.vnet.ibm.com] * Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. [Authored by e...@linux.vnet.ibm.com] * Implement a powernv_cpufreq_get(unsigned int cpu) method which will return the current operating frequency. Export this via the sysfs interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com] [Change log updated by e...@linux.vnet.ibm.com] Reviewed-by: Preeti U Murthy Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Srivatsa S. Bhat Signed-off-by: Anton Blanchard Signed-off-by: Gautham R. Shenoy --- arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/include/asm/reg.h| 4 + arch/powerpc/platforms/powernv/Kconfig| 6 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 372 ++ 7 files changed, 393 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index e9a8b4e..a285d44 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -353,3 +353,4 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m CONFIG_VIRTUALIZATION=y CONFIG_KVM_BOOK3S_64=m CONFIG_KVM_BOOK3S_64_HV=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/configs/pseries_le_defconfig b/arch/powerpc/configs/pseries_le_defconfig index 62771e0..47e6161 100644 --- a/arch/powerpc/configs/pseries_le_defconfig +++ b/arch/powerpc/configs/pseries_le_defconfig @@ -350,3 +350,4 @@ CONFIG_CRYPTO_LZO=m # CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_DEV_NX=y CONFIG_CRYPTO_DEV_NX_ENCRYPT=m +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 90c06ec..84f92ca 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Regist
Re: [PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform
On Thu, Mar 27, 2014 at 12:09:53PM +0530, Viresh Kumar wrote: > Cc'ing Rafael. > Thanks. It was a lapse on my part by not Cc'ing Rafael in the first place. > On 26 March 2014 22:25, Gautham R. Shenoy wrote: > > From: Vaidyanathan Srinivasan > > > > Backend driver to dynamically set voltage and frequency on > > IBM POWER non-virtualized platforms. Power management SPRs > > are used to set the required PState. > > > > This driver works in conjunction with cpufreq governors > > like 'ondemand' to provide a demand based frequency and > > voltage setting on IBM POWER non-virtualized platforms. > > > > PState table is obtained from OPAL v3 firmware through device > > tree. > > > > powernv_cpufreq back-end driver would parse the relevant device-tree > > nodes and initialise the cpufreq subsystem on powernv platform. > > > > The code was originally written by sva...@linux.vnet.ibm.com. Over > > time it was modified to accomodate bug-fixes as well as updates to the > > the cpu-freq core. Relevant portions of the change logs corresponding > > to those modifications are noted below: > > > > * The policy->cpus needs to be populated in a hotplug-invariant > >manner instead of using cpu_sibling_mask() which varies with > >cpu-hotplug. This is because the cpufreq core code copies this > >content into policy->related_cpus mask which is should not vary on > > s/is / Good catch! Will fix this [...] > > > * On POWER systems, the CPU frequency is controlled at a core-level > >and hence we need to serialize so that only one of the threads in > >the core switches the core's frequency at a time. Introduce > >per-core locking to enable finer-grained synchronization and > >thereby enhance the speed and responsiveness of the cpufreq driver > >to varying workload demands. > > > > The design of per-core locking is very simple and > >straight-forward: we first define a Per-CPU lock and use the ones > >that belongs to the first thread sibling of the core. > > > > cpu_first_thread_sibling() macro is used to find the *common* > >lock for all thread siblings belonging to a core. [Authored by > >srivatsa.b...@linux.vnet.ibm.com] > > We don't need that after serialization patch of core is accepted. And it > should be accepted soon, in one form or other. As of now, I prefer this patch be based on code that is in the -next tree. I'll get rid of the per-core locking once the serialization patch of the core is accepted. [...] > > --- a/arch/powerpc/configs/pseries_defconfig > > +++ b/arch/powerpc/configs/pseries_defconfig > > @@ -353,3 +353,4 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m > > CONFIG_VIRTUALIZATION=y > > CONFIG_KVM_BOOK3S_64=m > > CONFIG_KVM_BOOK3S_64_HV=y > > +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y > > diff --git a/arch/powerpc/configs/pseries_le_defconfig > > b/arch/powerpc/configs/pseries_le_defconfig > > index 62771e0..47e6161 100644 > > --- a/arch/powerpc/configs/pseries_le_defconfig > > +++ b/arch/powerpc/configs/pseries_le_defconfig > > @@ -350,3 +350,4 @@ CONFIG_CRYPTO_LZO=m > > # CONFIG_CRYPTO_ANSI_CPRNG is not set > > CONFIG_CRYPTO_DEV_NX=y > > CONFIG_CRYPTO_DEV_NX_ENCRYPT=m > > +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y > > don't know how Rafael want this, but probably this part could have gone > through ppc tree.. > That would mean multiple patches :-) > > + > > +#define pr_fmt(fmt)"powernv-cpufreq: " fmt > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > I thought I gave a comment on missing headers here? Ok, so these seem to be the missing ones. #include #include #include #include #include #include Have I missed anything else ? > > +static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; > > +static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; > > I though we don't need this anymore? Either Rafael will take my patch as > is for the BOOST fixup or we will end up creating .isboost field in > struct cpufreq_frequency_table and so you could have used .driver_data here. > I mentioned in my reply to your BOOST fixup patch that we would like pstate_ids to be of the type "signed int", while the .driver_data is an "unsigned int". If we end up using .driver_data, we would have to be careful and not use it directly but reassign it to a signed int (or typecast it) before using it. Not just that, the pr_debugs in the core which are printed during fr
Re: [PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform
On Thu, Mar 27, 2014 at 03:29:36PM +0530, Viresh Kumar wrote: > On 27 March 2014 15:00, Gautham R Shenoy wrote: > > As of now, I prefer this patch be based on code that is in the -next > > tree. I'll get rid of the per-core locking once the serialization > > patch of the core is accepted. > > Okay.. Then divide this patch into two parts, second one doing all > the serialization stuff and I will review only the first one from V5 :) > > Maybe mention that in Patch2 as well, that once serialization patches > are accepted, its not required. Sure. This would be a good idea. > > >> don't know how Rafael want this, but probably this part could have gone > >> through ppc tree.. > > > > That would mean multiple patches :-) > > I wasn't opposed to multiple patches at all, but multiple patches for > basic functionality of same driver file. Otherwise separating things > into patches is the best receipe. Fair enough! > > >> > +#define pr_fmt(fmt)"powernv-cpufreq: " fmt > >> > + > >> > +#include > >> > +#include > >> > +#include > >> > +#include > >> > +#include > >> > >> I thought I gave a comment on missing headers here? > > > > Ok, so these seem to be the missing ones. > > > > #include > > #include > > #include > > This may shift to Patch 2 .. Ok. > > > #include > > #include > > > > #include > > > > Have I missed anything else ? > > Not sure :) .. I will see if I can find anymore.. > Cool. > >> > +static struct cpufreq_frequency_table > >> > powernv_freqs[POWERNV_MAX_PSTATES+1]; > >> > +static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; > >> > >> I though we don't need this anymore? Either Rafael will take my patch as > >> is for the BOOST fixup or we will end up creating .isboost field in > >> struct cpufreq_frequency_table and so you could have used .driver_data > >> here. > >> > > > > I mentioned in my reply to your BOOST fixup patch that we would like > > pstate_ids to be of the type "signed int", while the .driver_data is > > an "unsigned int". If we end up using .driver_data, we would have to > > be careful and not use it directly but reassign it to a signed int (or > > typecast it) before using it. > > Probably many people would want it to be unsigned it, so that they > can put some strange hex values there.. > > Which part of your code conflicts with this right now? I can see only > this line in powernv_set_freq() I am not denying the advantage of storing pstate_ids in .driver_data since we will then have all the information in one place. (That said, in the future if we want to export additional information, say pertaining to the voltage, we will have to keep a separate array anyway :-) ) However, as of now, I am wary about reusing a variable provided by the core, whose type is fixed (which is not the type I am interested in) and whose interpretation is ambiguous (since it is also being used as a BOOST indicator). When we have a .is_boost field, or a flags field in the cpufreq_table, and once the .driver_data field is completely opaque, I would be comfortable making use of .driver_data to store the pstate_ids. So I'll send out those patches once the fixes are present in the core. For now, let this code stay as it is. I think we should be ok with the additional overhead of 1kb as of now. > > freq_data.pstate_id = powernv_pstate_ids[new_index]; > > And I don't think we will have side effects here. > > > Not just that, the pr_debugs in the core which are printed during > > frequency transitions will then end up printing large positive values > > (since it will interpret negative pstate_ids as unsigned int) in the > > place of pstate_ids, which would not be particularly useful. > > I have checked it again, those prints shouldn't have used that field. > Will fix them. > > And so you can use that field in your code and I will get core fixed > for you.. Like I said earlier, once the core is fixed, I'll be comfortable using the .driver_data field for the purpose I need. Not until then :-) > > >> Why do you need to get these from DT? And not find that yourself here > >> instead? > > > > Well, I think we can obtain a more accurate value from the DT which > > would have already computed it. But I'll let vaidy answer this. > > Can we simply refer to frequency values here for finding min and max > pstates? If yes, then probably this is the better place as there can be > h
[PATCH v5 0/3] powernv,cpufreq: Dynamic Frequency Scaling support
From: "Gautham R. Shenoy" Hi, This is v5 of the patchset to enable dynamic frequency scaling on IBM PowerNV platforms. This patchset does address all the review comments obtained for v4 (which can be found at [1]). Changes from v4: * Created a separate patch to select the CPUFreq related Config options for PowerNV * Dropped the per-core locking hunks in drivers/cpufreq/powernv-cpufreq.c since the CPUFreq core takes care of the for us after the following commit which is present in linux-next: commit 12478cf0c55e5969f740bb38a24b1a0104ae18d8 Author: Srivatsa S. Bhat Date: Mon Mar 24 13:35:44 2014 +0530 cpufreq: Make sure frequency transitions are serialized * [PATCH v5 3/3] gets rid of the powernv_pstate_ids[] array that was being used to record the pstate ids. After the following patch it is safe to use cpufreq_frequency_table.driver_data since it is opaque to the cpufreq core: From: Viresh Kumar Date: 2014-03-28 13:53:47 url: http://marc.info/?l=linux-pm&m=139601416804702&w=2 cpufreq: create another field .flags in cpufreq_frequency_table The patchset is based on the commit 201544be8c37dffbf069bb5fc9edb5674f8c1754 of the linux-next tree. While all the patches in the patchset apply cleanly on linux-next, [PATCH v5 3/3] requires the Viresh's patch that was mentioned above. Otherwise, the frequency corresponding to pstate id -3 will be omited while reporting the "scaling_available_frequencies" in sysfs. [1]: http://marc.info/?l=linux-pm&m=139585297620612&w=2 Gautham R. Shenoy (2): powernv, cpufreq: Select CPUFreq related Kconfig options for powernv powernv,cpufreq: Use cpufreq_frequency_table.driver_data to store pstate ids Vaidyanathan Srinivasan (1): powernv, cpufreq: cpufreq driver for powernv platform arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/include/asm/reg.h| 4 + arch/powerpc/platforms/powernv/Kconfig| 6 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 341 ++ 7 files changed, 362 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v5 2/3] powernv, cpufreq: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The code was originally written by sva...@linux.vnet.ibm.com. Over time it was modified to accomodate bug-fixes as well as updates to the the cpu-freq core. Relevant portions of the change logs corresponding to those modifications are noted below: * The policy->cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy->related_cpus mask which should not vary on cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com] * Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. [Authored by e...@linux.vnet.ibm.com] * Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. [Authored by e...@linux.vnet.ibm.com] * Implement a powernv_cpufreq_get(unsigned int cpu) method which will return the current operating frequency. Export this via the sysfs interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com] [Change log updated by e...@linux.vnet.ibm.com] Reviewed-by: Preeti U Murthy Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Srivatsa S. Bhat Signed-off-by: Anton Blanchard Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/reg.h| 4 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 342 ++ 4 files changed, 355 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 1a36b8e..2189f8f 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..72564b7 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,11 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate "CPU frequency scaling for IBM POWERNV platform" + depends on PPC_POWERNV + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..e1e5197 --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,342 @@ +/* + * POWERNV cpufreq driver for the IBM POWER processors + * + * (C) Copyright IBM 2014 + * + * Autho
[PATCH v5 1/3] powernv, cpufreq: Select CPUFreq related Kconfig options for powernv
From: "Gautham R. Shenoy" Enable CPUFreq for PowerNV. Select "performance", "powersave", "userspace" and "ondemand" governors. Choose "ondemand" to be the default governor. Signed-off-by: Gautham R. Shenoy Signed-off-by: Srivatsa S. Bhat --- arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/platforms/powernv/Kconfig| 6 ++ 3 files changed, 8 insertions(+) diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index 9ea8342b..a905063 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -306,3 +306,4 @@ CONFIG_KVM_BOOK3S_64=m CONFIG_KVM_BOOK3S_64_HV=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/configs/pseries_le_defconfig b/arch/powerpc/configs/pseries_le_defconfig index 3c84f9d..58e3dbf 100644 --- a/arch/powerpc/configs/pseries_le_defconfig +++ b/arch/powerpc/configs/pseries_le_defconfig @@ -301,3 +301,4 @@ CONFIG_CRYPTO_LZO=m # CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_DEV_NX=y CONFIG_CRYPTO_DEV_NX_ENCRYPT=m +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 895e8a2..c252ee9 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,6 +11,12 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM + select CPU_FREQ + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_POWERSAVE + select CPU_FREQ_GOV_USERSPACE + select CPU_FREQ_GOV_ONDEMAND + select CPU_FREQ_GOV_CONSERVATIVE default y config PPC_POWERNV_RTAS -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v5 3/3] powernv, cpufreq: Use cpufreq_frequency_table.driver_data to store pstate ids
From: "Gautham R. Shenoy" The .driver_data field in the cpufreq_frequency_table was supposed to be private to the drivers. However at some later point, it was being used to indicate if the particular frequency in the table is the BOOST_FREQUENCY. After patches [1] and [2], the .driver_data is once again private to the driver. Thus we can safely use cpufreq_frequency_table.driver_data to store pstate_ids instead of having to maintain a separate array powernv_pstate_ids[] for this purpose. [1]: Subject: cpufreq: don't print value of .driver_data from core From : Viresh Kumar url: http://marc.info/?l=linux-pm&m=139601421504709&w=2 [2]: Subject: cpufreq: create another field .flags in cpufreq_frequency_table From : Viresh Kumar url: http://marc.info/?l=linux-pm&m=139601416804702&w=2 Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index e1e5197..9edccc6 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -33,7 +33,6 @@ #define POWERNV_MAX_PSTATES256 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; -static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; /* * Note: The set of pstates consists of contiguous integers, the @@ -112,7 +111,7 @@ static int init_powernv_pstates(void) pr_debug("PState id %d freq %d MHz\n", id, freq); powernv_freqs[i].frequency = freq * 1000; /* kHz */ - powernv_pstate_ids[i] = id; + powernv_freqs[i].driver_data = id; } /* End of list marker entry */ powernv_freqs[i].frequency = CPUFREQ_TABLE_END; @@ -283,7 +282,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy, { struct powernv_smp_call_data freq_data; - freq_data.pstate_id = powernv_pstate_ids[new_index]; + freq_data.pstate_id = powernv_freqs[new_index].driver_data; /* * Use smp_call_function to send IPI and execute the -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH REPOST v5 2/3] powernv, cpufreq: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The code was originally written by sva...@linux.vnet.ibm.com. Over time it was modified to accomodate bug-fixes as well as updates to the the cpu-freq core. Relevant portions of the change logs corresponding to those modifications are noted below: * The policy->cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy->related_cpus mask which should not vary on cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com] * Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. [Authored by e...@linux.vnet.ibm.com] * Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. [Authored by e...@linux.vnet.ibm.com] * Implement a powernv_cpufreq_get(unsigned int cpu) method which will return the current operating frequency. Export this via the sysfs interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com] [Change log updated by e...@linux.vnet.ibm.com] Reviewed-by: Preeti U Murthy Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Srivatsa S. Bhat Signed-off-by: Anton Blanchard Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/reg.h| 4 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 342 ++ 4 files changed, 355 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 1a36b8e..2189f8f 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..72564b7 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,11 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate "CPU frequency scaling for IBM POWERNV platform" + depends on PPC_POWERNV + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..e1e5197 --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,342 @@ +/* + * POWERNV cpufreq driver for the IBM POWER processors + * + * (C) Copyright IBM 2014 + * + * Autho
[PATCH REPOST v5 3/3] powernv, cpufreq: Use cpufreq_frequency_table.driver_data to store pstate ids
From: "Gautham R. Shenoy" The .driver_data field in the cpufreq_frequency_table was supposed to be private to the drivers. However at some later point, it was being used to indicate if the particular frequency in the table is the BOOST_FREQUENCY. After patches [1] and [2], the .driver_data is once again private to the driver. Thus we can safely use cpufreq_frequency_table.driver_data to store pstate_ids instead of having to maintain a separate array powernv_pstate_ids[] for this purpose. [1]: Subject: cpufreq: don't print value of .driver_data from core From : Viresh Kumar url: http://marc.info/?l=linux-pm&m=139601421504709&w=2 [2]: Subject: cpufreq: create another field .flags in cpufreq_frequency_table From : Viresh Kumar url: http://marc.info/?l=linux-pm&m=139601416804702&w=2 Signed-off-by: Gautham R. Shenoy --- drivers/cpufreq/powernv-cpufreq.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index e1e5197..9edccc6 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -33,7 +33,6 @@ #define POWERNV_MAX_PSTATES256 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; -static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; /* * Note: The set of pstates consists of contiguous integers, the @@ -112,7 +111,7 @@ static int init_powernv_pstates(void) pr_debug("PState id %d freq %d MHz\n", id, freq); powernv_freqs[i].frequency = freq * 1000; /* kHz */ - powernv_pstate_ids[i] = id; + powernv_freqs[i].driver_data = id; } /* End of list marker entry */ powernv_freqs[i].frequency = CPUFREQ_TABLE_END; @@ -283,7 +282,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy, { struct powernv_smp_call_data freq_data; - freq_data.pstate_id = powernv_pstate_ids[new_index]; + freq_data.pstate_id = powernv_freqs[new_index].driver_data; /* * Use smp_call_function to send IPI and execute the -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH REPOST v5 1/3] powernv, cpufreq: Select CPUFreq related Kconfig options for powernv
From: "Gautham R. Shenoy" Enable CPUFreq for PowerNV. Select "performance", "powersave", "userspace" and "ondemand" governors. Choose "ondemand" to be the default governor. Signed-off-by: Gautham R. Shenoy Signed-off-by: Srivatsa S. Bhat --- arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/platforms/powernv/Kconfig| 6 ++ 3 files changed, 8 insertions(+) diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index 9ea8342b..a905063 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -306,3 +306,4 @@ CONFIG_KVM_BOOK3S_64=m CONFIG_KVM_BOOK3S_64_HV=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/configs/pseries_le_defconfig b/arch/powerpc/configs/pseries_le_defconfig index 3c84f9d..58e3dbf 100644 --- a/arch/powerpc/configs/pseries_le_defconfig +++ b/arch/powerpc/configs/pseries_le_defconfig @@ -301,3 +301,4 @@ CONFIG_CRYPTO_LZO=m # CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_DEV_NX=y CONFIG_CRYPTO_DEV_NX_ENCRYPT=m +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 895e8a2..c252ee9 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,6 +11,12 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM + select CPU_FREQ + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_POWERSAVE + select CPU_FREQ_GOV_USERSPACE + select CPU_FREQ_GOV_ONDEMAND + select CPU_FREQ_GOV_CONSERVATIVE default y config PPC_POWERNV_RTAS -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] book3s_hv: Handle H_DOORBELL on the guest exit path
Currently a CPU running a guest can receive a H_DOORBELL in the following two cases: 1) When the CPU is napping due to CEDE or there not being a guest vcpu. 2) The CPU is running the guest vcpu. Case 1), the doorbell message is not cleared since we were waking up from nap. Hence when the EE bit gets set on transition from guest to host, the H_DOORBELL interrupt is delivered to the host and the corresponding handler is invoked. However in Case 2), the message gets cleared by the action of taking the H_DOORBELL interrupt. Since the CPU was running a guest, instead of invoking the doorbell handler, the code invokes the second-level interrupt handler to switch the context from the guest to the host. At this point the setting of the EE bit doesn't result in the CPU getting the doorbell interrupt since it has already been delivered once. So, the handler for this doorbell is never invoked! This causes softlockups if the missed DOORBELL was as IPI sent from a sibling subcore CPU. This patch fixes it by explitly invoking the doorbell handler on the exit path if the exit reason is H_DOORBELL similar to the way an EXTERNAL interrupt is handled. Since this will also handle Case 1), we can unconditionally clear the doorbell message in kvmppc_check_wake_reason. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index b98889e..106c7f9 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -150,6 +150,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) cmpwi cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL beq 11f + cmpwi r12, BOOK3S_INTERRUPT_H_DOORBELL + beq 15f /* Invoke the H_DOORBELL handler */ cmpwi cr2, r12, BOOK3S_INTERRUPT_HMI beq cr2, 14f/* HMI check */ @@ -174,6 +176,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) mtspr SPRN_HSRR1, r7 b hmi_exception_after_realmode +15:mtspr SPRN_HSRR0, r8 + mtspr SPRN_HSRR1, r7 + ba0xe80 + kvmppc_primary_no_guest: /* We handle this much like a ceded vcpu */ /* put the HDEC into the DEC, since HDEC interrupts don't wake us */ @@ -2436,14 +2442,19 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) /* hypervisor doorbell */ 3: li r12, BOOK3S_INTERRUPT_H_DOORBELL + + /* +* Clear the doorbell as we will invoke the handler +* explicitly in the guest exit path. +*/ + lis r6, (PPC_DBELL_SERVER << (63-36))@h + PPC_MSGCLR(6) /* see if it's a host IPI */ li r3, 1 lbz r0, HSTATE_HOST_IPI(r13) cmpwi r0, 0 bnelr - /* if not, clear it and return -1 */ - lis r6, (PPC_DBELL_SERVER << (63-36))@h - PPC_MSGCLR(6) + /* if not, return -1 */ li r3, -1 blr -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RESEND v4 3/4] cpufreq: powernv: Add a trace print for the throttle event
Hi Shilpa, Just saw this resend! On Tue, Jan 12, 2016 at 04:24:26AM -0600, Shilpasri G Bhat wrote: > Record the throttle event with a trace print replacing the printk, > except for events like throttling below nominal and occ reset > event which print a warning message. > > Signed-off-by: Shilpasri G Bhat > --- [..snip..] > > -static void powernv_cpufreq_throttle_check(void *data) > +static void powernv_cpufreq_check_pmax(void) ^^^ This function only contains code moved from powernv_cpufreq_throttle_check with pr_crit/pr_warns replaced by trace_powernv_throttle. Furthermore, it is not called from any other place. Given that the original function was ~60 lines do we really need to split it into two separate functions ? If yes, could it be an inline function ? > { > unsigned int cpu = smp_processor_id(); > unsigned int chip_id = pir_to_chip_id(hard_smp_processor_id()); > - unsigned long pmsr; > int pmsr_pmax, i; > > - pmsr = get_pmspr(SPRN_PMSR); > + pmsr_pmax = (s8)PMSR_MAX(get_pmspr(SPRN_PMSR)); > > for (i = 0; i < nr_chips; i++) > if (chips[i].id == chip_id) > break; > > - /* Check for Pmax Capping */ > - pmsr_pmax = (s8)PMSR_MAX(pmsr); > if (pmsr_pmax != powernv_pstate_info.max) { > if (chips[i].throttled) > - goto next; > + return; > + > chips[i].throttled = true; > if (pmsr_pmax < powernv_pstate_info.nominal) > - pr_crit("CPU %d on Chip %u has Pmax reduced below > nominal frequency (%d < %d)\n", > - cpu, chips[i].id, pmsr_pmax, > - powernv_pstate_info.nominal); > - else > - pr_info("CPU %d on Chip %u has Pmax reduced below turbo > frequency (%d < %d)\n", > - cpu, chips[i].id, pmsr_pmax, > - powernv_pstate_info.max); > + pr_warn_once("CPU %d on Chip %u has Pmax reduced below > nominal frequency (%d < %d)\n", > + cpu, chips[i].id, pmsr_pmax, > + powernv_pstate_info.nominal); > + > + trace_powernv_throttle(chips[i].id, > +throttle_reason[chips[i].throt_reason], > +pmsr_pmax); > } else if (chips[i].throttled) { > chips[i].throttled = false; > - pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu, > - chips[i].id, pmsr_pmax); > + trace_powernv_throttle(chips[i].id, > +throttle_reason[chips[i].throt_reason], > +pmsr_pmax); > } > +} > + > +static void powernv_cpufreq_throttle_check(void *data) > +{ > + unsigned long pmsr; > + > + pmsr = get_pmspr(SPRN_PMSR); > + > + /* Check for Pmax Capping */ > + powernv_cpufreq_check_pmax(); If you want to retain this function, you could pass pmsr as an argument instead of computing it afresh in powernv_cpufreq_check_pmax() > /* Check if Psafe_mode_active is set in PMSR. */ > -next: > if (pmsr & PMSR_PSAFE_ENABLE) { > throttled = true; > pr_info("Pstate set to safe frequency\n"); -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RESEND v4 4/4] cpufreq: powernv: Add sysfs attributes to show throttle stats
Hi Shilpa, On Tue, Jan 12, 2016 at 04:24:27AM -0600, Shilpasri G Bhat wrote: > +static inline int get_chip_index(struct kobject *kobj) Probably have "get_chip_index(int id)". See the reason below. > +{ > + int i, id; > + > + i = kstrtoint(kobj->name + 4, 0, &id); > + if (i) > + return i; > + > + for (i = 0; i < nr_chips; i++) > + if (chips[i].id == id) > + return i; This pattern to obtain a chip index from the chip id is repeated in multiple place inside this file. Might be worthwhile to move this to a helper function, i.e get_chip_index(id)! > + return -EINVAL; > +} > + > +static ssize_t throttle_freq_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + int i, count = 0, id; > + We obtain the id from kobj here and then obtain the index from id via the function below. > + id = get_chip_index(kobj); > + if (id < 0) > + return id; > + > + for (i = 0; i < powernv_pstate_info.nr_pstates; i++) > + count += sprintf(&buf[count], "%d %d\n", > +powernv_freqs[i].frequency, > +chips[id].pstate_stat[i]); > + > + return count; > +} > + > +static struct kobj_attribute attr_throttle_frequencies = > +__ATTR(throttle_frequencies, 0444, throttle_freq_show, NULL); > + -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 5/5] cpufreq: powernv: Add sysfs attributes to show throttle stats
On Thu, Jan 21, 2016 at 03:08:59PM +0530, Shilpasri G Bhat wrote: > Signed-off-by: Shilpasri G Bhat Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 2/5] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path
Hello Balbir, On Sat, Jan 23, 2016 at 07:59:20PM +1100, Balbir Singh wrote: > On Fri, 22 Jan 2016 12:49:02 +0530 > Shilpasri G Bhat wrote: > > > cpu_to_chip_id() does a DT walk through to find out the chip id by > > taking a contended device tree lock. This adds an unnecessary overhead > > in a hot path. So instead of calling cpu_to_chip_id() everytime cache > > the chip ids for all cores in the array 'core_to_chip_map' and use it > > in the hotpath. > > > > Reported-by: Anton Blanchard > > Signed-off-by: Shilpasri G Bhat > > Reviewed-by: Gautham R. Shenoy > > snip > > Does the core_to_chip_map need to be updated/refreshed on/after/ a > cpu (core) hotplug? I presume id's don't change No, the id's don't change on cpu/core hotplug. The core_to_chip_map is initialized in init_chip_info() where we use for_each_possible_cpu(). Thanks for the review! > > Balbir > -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 1/6] cpufreq: powernv: Free 'chips' on module exit
On Thu, Jan 28, 2016 at 12:55:36PM +0530, Shilpasri G Bhat wrote: > This will free the dynamically allocated memory of'chips' on > module exit. > > Signed-off-by: Shilpasri G Bhat Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path
On Thu, Jan 28, 2016 at 12:55:38PM +0530, Shilpasri G Bhat wrote: > cpu_to_chip_id() does a DT walk through to find out the chip id by > taking a contended device tree lock. This adds an unnecessary overhead > in a hot path. So instead of calling cpu_to_chip_id() everytime cache > the chip ids for all cores in the array 'core_to_chip_map' and use it > in the hotpath. > > Reported-by: Anton Blanchard > Signed-off-by: Shilpasri G Bhat Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 5/6] cpufreq: powernv: Replace pr_info with trace print for throttle event
On Thu, Jan 28, 2016 at 12:55:40PM +0530, Shilpasri G Bhat wrote: > Currently we use printk message to notify the throttle event. But this > can flood the console if the cpu is throttled frequently. So replace the > printk with the tracepoint to notify the throttle event. And also events > like throttle below nominal frequency and OCC_RESET are reduced to > pr_warn/pr_warn_once as pointed by MFG to not mark them as critical > messages. This patch adds 'throttle_reason' to struct chip to store the > throttle reason. > > Signed-off-by: Shilpasri G Bhat Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 6/6] cpufreq: powernv: Add sysfs attributes to show throttle stats
Hi Shilpa, A minor nit. On Thu, Jan 28, 2016 at 12:55:41PM +0530, Shilpasri G Bhat wrote: [..snip..] > + > +What: > /sys/devices/system/cpu/cpufreq/chip*/throttle_reasons/ > +Date:Jan 2016 > +Contact: Linux kernel mailing list > + Linux for PowerPC mailing list > +Description: CPU Frequency throttle reason stat for the chip > + > + This directory contains throttle reason files. Each file gives > + the total number of times the max frequency is throttled, except > + for 'unthrottle_count', which gives the total number of times > + the max frequency is unthrottled after being throttled. Below > + are the reason attributes. > + > + cpu_over_temperature: Throttled due to cpu over temperature > + > + occ_reset: Throttled due to reset of OCC > + > + over_current: Throttled due to over current Overcurrent is a single word. No need of the extra space. You could fix that and add my Reviewed-by. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 6/6] cpufreq: powernv: Add sysfs attributes to show throttle stats
Hi Viresh, > > What I can suggest is: > - Move this directory inside cpuX/cpufreq/ directory, in a similar way > as to how we create 'stats' directory today. > - You can then get policy->cpu, to get chip->id out of it. > - The only disadvantage here is that the same chip directory will be > replicated in multiple policies, but that makes it more readable. Thinking about it, having a sysfs group attached to a policy kobject looks ok if replication of the same chip information across multiple policies is not objectionable. Regarding the table-format, it breaks the sysfs's one-value-per-file rule. So I would still prefer each throttle reason being a separate file which gives the number of times the chip frequency was throttled due to that reason. We can live without the per-frequency throttle stats listed in the throttle_status. So, would the following be sysfs group structure be acceptable? $ls -1 /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/ unthrottle powercap overtemp supply_fault overcurrent occ_reset turbo_stat sub_turbo_stat -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc/pseries: Only wait for dying CPU after call to rtas_stop_self()
Hello Thiago, On Fri, Feb 22, 2019 at 07:57:52PM -0300, Thiago Jung Bauermann wrote: > When testing DLPAR CPU add/remove on a system under stress, > pseries_cpu_die() doesn't wait long enough for a CPU to die: > > [ 446.983944] cpu 148 (hwid 148) Ready to die... > [ 446.984062] cpu 149 (hwid 149) Ready to die... > [ 446.993518] cpu 150 (hwid 150) Ready to die... > [ 446.993543] Querying DEAD? cpu 150 (150) shows 2 > [ 446.994098] cpu 151 (hwid 151) Ready to die... > [ 447.133726] cpu 136 (hwid 136) Ready to die... > [ 447.403532] cpu 137 (hwid 137) Ready to die... > [ 447.403772] cpu 138 (hwid 138) Ready to die... > [ 447.403839] cpu 139 (hwid 139) Ready to die... > [ 447.403887] cpu 140 (hwid 140) Ready to die... > [ 447.403937] cpu 141 (hwid 141) Ready to die... > [ 447.403979] cpu 142 (hwid 142) Ready to die... > [ 447.404038] cpu 143 (hwid 143) Ready to die... > [ 447.513546] cpu 128 (hwid 128) Ready to die... > [ 447.693533] cpu 129 (hwid 129) Ready to die... > [ 447.693999] cpu 130 (hwid 130) Ready to die... > [ 447.703530] cpu 131 (hwid 131) Ready to die... > [ 447.704087] Querying DEAD? cpu 132 (132) shows 2 > [ 447.704102] cpu 132 (hwid 132) Ready to die... > [ 447.713534] cpu 133 (hwid 133) Ready to die... > [ 447.714064] Querying DEAD? cpu 134 (134) shows 2 > > This is a race between one CPU stopping and another one calling > pseries_cpu_die() to wait for it to stop. That function does a short busy > loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify > that it is stopped, but I think there's a lot for the stopping CPU to do > which may take longer than this loop allows. > > As can be seen in the dmesg right before or after the "Querying DEAD?" > messages, if pseries_cpu_die() waited a little longer it would have seen > the CPU in the stopped state. > > I see two cases that can be causing this race: > > 1. It's possible that CPU 134 was inactive at the time it was unplugged. In >that case, dlpar_offline_cpu() calls H_PROD on that CPU and immediately >calls pseries_cpu_die(). Meanwhile, the prodded CPU activates and start >the process of stopping itself. It's possible that the busy loop is not >long enough to allow for the CPU to wake up and complete the stopping >process. The problem is a bit more severe since, after printing "Querying DEAD?" for CPU X, this CPU can prod another offline CPU Y on the same core which, on waking up, will call rtas_stop_self. Thus we can have two concurrent calls to rtas-stop-self, which is prohibited by the PAPR. > > 2. If CPU 134 was online at the time it was unplugged, it would have gone >through the new CPU hotplug state machine in kernel/cpu.c that was >introduced in v4.6 to get itself stopped. It's possible that the busy >loop in pseries_cpu_die() was long enough for the older hotplug code but >not for the new hotplug state machine. I haven't been able to observe the "Querying DEAD?" messages for the online CPU which was being offlined and dlpar'ed out. > > I don't know if this race condition has any ill effects, but we can make > the race a lot more even if we only start querying if the CPU is stopped > when the stopping CPU is close to call rtas_stop_self(). > > Since pseries_mach_cpu_die() sets the CPU current state to offline almost > immediately before calling rtas_stop_self(), we use that as a signal that > it is either already stopped or very close to that point, and we can start > the busy loop. > > As suggested by Michael Ellerman, this patch also changes the busy loop to > wait for a fixed amount of wall time. > > Signed-off-by: Thiago Jung Bauermann > --- > arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > I tried to estimate good amounts for the timeout and loop delays, but > I'm not sure how reasonable my numbers are. The busy loops will wait for > 100 µs between each try, and spin_event_timeout() will timeout after > 100 ms. I'll be happy to change these values if you have better > suggestions. Based on the measurements that I did on a POWER9 system, in successful cases of smp_query_cpu_stopped(cpu) returning affirmative, the maximum time spent inside the loop was was 10ms. > Gautham was able to test this patch and it solved the race condition. > > v1 was a cruder patch which just increased the number of loops: > https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html > > v1 also mentioned a kernel crash but Gautham narrowed it down to a bug > in RTAS, which is in the process of being fixed. > > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c > b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index 97feb6e79f1a..424146cc752e 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -214,13 +214,21 @@ static void pseries_cpu_die(unsigned int cpu) > msleep(1); >
Re: [PATCH] powernv: powercap: Add hard minimum powercap
Hi Shilpa, On Thu, Feb 28, 2019 at 11:25:25AM +0530, Shilpasri G Bhat wrote: > Hi, > > On 02/28/2019 10:14 AM, Daniel Axtens wrote: > > Shilpasri G Bhat writes: > > > >> In POWER9, OCC(On-Chip-Controller) provides for hard and soft system > >> powercapping range. The hard powercap range is guaranteed while soft > >> powercap may or may not be asserted due to various power-thermal > >> reasons based on system configuration and workloads. This patch adds > >> a sysfs file to export the hard minimum powercap limit to allow the > >> user to set the appropriate powercap value that can be managed by the > >> system. > > > > Maybe it's common terminology and I'm just not aware of it, but what do > > you mean by "asserted"? It doesn't appear elsewhere in the documentation > > you're patching, and it's not a use of assert that I'm familiar with... > > > > Regards, > > Daniel > > > > I meant to say powercap will not be assured in the soft powercap range, i.e, > system may or may not be throttled of CPU frequency to remain within the > powercap. > > I can reword the document and commit message. I agree with Daniel. How about replacing "asserted" with "enforced by the OCC"? > > Thanks and Regards, > Shilpa -- Thanks and Regards gautham. > > >> > >> Signed-off-by: Shilpasri G Bhat > >> --- > >> .../ABI/testing/sysfs-firmware-opal-powercap | 10 > >> arch/powerpc/platforms/powernv/opal-powercap.c | 66 > >> +- > >> 2 files changed, 37 insertions(+), 39 deletions(-) > >> > >> diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-powercap > >> b/Documentation/ABI/testing/sysfs-firmware-opal-powercap > >> index c9b66ec..65db4c1 100644 > >> --- a/Documentation/ABI/testing/sysfs-firmware-opal-powercap > >> +++ b/Documentation/ABI/testing/sysfs-firmware-opal-powercap > >> @@ -29,3 +29,13 @@ Description:System powercap directory and > >> attributes applicable for > >> creates a request for setting a new-powercap. The > >> powercap requested must be between powercap-min > >> and powercap-max. > >> + > >> +What: > >> /sys/firmware/opal/powercap/system-powercap/powercap-hard-min > >> +Date: Feb 2019 > >> +Contact: Linux for PowerPC mailing list > >> +Description: Hard minimum powercap > >> + > >> + This file provides the hard minimum powercap limit in > >> + Watts. The powercap value above hard minimum is always > >> + guaranteed to be asserted and the powercap value below > >> + the hard minimum limit may or may not be guaranteed. > >> diff --git a/arch/powerpc/platforms/powernv/opal-powercap.c > >> b/arch/powerpc/platforms/powernv/opal-powercap.c > >> index d90ee4f..38408e7 100644 > >> --- a/arch/powerpc/platforms/powernv/opal-powercap.c > >> +++ b/arch/powerpc/platforms/powernv/opal-powercap.c > >> @@ -139,10 +139,24 @@ static void powercap_add_attr(int handle, const char > >> *name, > >>attr->handle = handle; > >>sysfs_attr_init(&attr->attr.attr); > >>attr->attr.attr.name = name; > >> - attr->attr.attr.mode = 0444; > >> + > >> + if (!strncmp(name, "powercap-current", strlen(name))) { > >> + attr->attr.attr.mode = 0664; > >> + attr->attr.store = powercap_store; > >> + } else { > >> + attr->attr.attr.mode = 0444; > >> + } > >> + > >>attr->attr.show = powercap_show; > >> } > >> > >> +static const char * const powercap_strs[] = { > >> + "powercap-max", > >> + "powercap-min", > >> + "powercap-current", > >> + "powercap-hard-min", > >> +}; > >> + > >> void __init opal_powercap_init(void) > >> { > >>struct device_node *powercap, *node; > >> @@ -167,60 +181,34 @@ void __init opal_powercap_init(void) > >> > >>i = 0; > >>for_each_child_of_node(powercap, node) { > >> - u32 cur, min, max; > >> - int j = 0; > >> - bool has_cur = false, has_min = false, has_max = false; > >> + u32 id; > >> + int j, count = 0; > >> > >> - if (!of_property_read_u32(node, "powercap-min", &min)) { > >> - j++; > >> - has_min = true; > >> - } > >> - > >> - if (!of_property_read_u32(node, "powercap-max", &max)) { > >> - j++; > >> - has_max = true; > >> - } > >> + for (j = 0; j < ARRAY_SIZE(powercap_strs); j++) > >> + if (!of_property_read_u32(node, powercap_strs[j], &id)) > >> + count++; > >> > >> - if (!of_property_read_u32(node, "powercap-current", &cur)) { > >> - j++; > >> - has_cur = true; > >> - } > >> - > >> - pcaps[i].pattrs = kcalloc(j, sizeof(struct powercap_attr), > >> + pcaps[i].pattrs = kcalloc(count, sizeof(struct powercap_attr), > >> GFP_KERNEL); > >>if (!pcaps[i].pattrs) > >>
Re: [PATCH v2] powerpc/powernv/idle: Restore IAMR after idle
Hello Russell, On Fri, Feb 08, 2019 at 10:11:03PM +1100, Russell Currey wrote: > Without restoring the IAMR after idle, execution prevention on POWER9 > with Radix MMU is overwritten and the kernel can freely execute userspace > without > faulting. > > This is necessary when returning from any stop state that modifies user > state, as well as hypervisor state. > > To test how this fails without this patch, load the lkdtm driver and > do the following: > >echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT > > which won't fault, then boot the kernel with powersave=off, where it > will fault. Applying this patch will fix this. > > Fixes: 3b10d0095a1e ("powerpc/mm/radix: Prevent kernel execution of user > space") > Cc: > Signed-off-by: Russell Currey > --- > Since v1: > - no longer use paca to save IAMR, instead use _DAR (thanks mpe) Looks good to me. Once we move to Nick Piggin's C-based save/restore code, we will be saving all these SPR values on the stack anyway. Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham.
[PATCH] pseries/energy: Use OF accessor functions to read ibm, drc-indexes
From: "Gautham R. Shenoy" In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent, we currently use of_read_property() to obtain the pointer to the array corresponding to the property "ibm,drc-indexes". The elements of this array are of type __be32, but are accessed without any conversion to the OS-endianness, which is buggy on a Little Endian OS. Fix this by using of_property_read_u32_index() accessor function to safely read the elements of the array. Fixes: commit e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU indexes") Cc: #v4.16+ Reported-by: Pavithra R. Prakash Signed-off-by: Gautham R. Shenoy --- arch/powerpc/platforms/pseries/pseries_energy.c | 27 - 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/platforms/pseries/pseries_energy.c index 6ed2212..1c4d1ba 100644 --- a/arch/powerpc/platforms/pseries/pseries_energy.c +++ b/arch/powerpc/platforms/pseries/pseries_energy.c @@ -77,18 +77,27 @@ static u32 cpu_to_drc_index(int cpu) ret = drc.drc_index_start + (thread_index * drc.sequential_inc); } else { - const __be32 *indexes; - - indexes = of_get_property(dn, "ibm,drc-indexes", NULL); - if (indexes == NULL) - goto err_of_node_put; + u32 nr_drc_indexes, thread_drc_index; /* -* The first element indexes[0] is the number of drc_indexes -* returned in the list. Hence thread_index+1 will get the -* drc_index corresponding to core number thread_index. +* The first element of ibm,drc-indexes array is the +* number of drc_indexes returned in the list. Hence +* thread_index+1 will get the drc_index corresponding +* to core number thread_index. */ - ret = indexes[thread_index + 1]; + rc = of_property_read_u32_index(dn, "ibm,drc-indexes", + 0, &nr_drc_indexes); + if (rc) + goto err_of_node_put; + + WARN_ON(thread_index > nr_drc_indexes); + rc = of_property_read_u32_index(dn, "ibm,drc-indexes", + thread_index + 1, + &thread_drc_index); + if (rc) + goto err_of_node_put; + + ret = thread_drc_index; } rc = 0; -- 1.9.4
Re: [PATCH v3] powerpc/pseries: Only wait for dying CPU after call to rtas_stop_self()
Hello Thiago, On Mon, Mar 11, 2019 at 04:35:17PM -0300, Thiago Jung Bauermann wrote: > When testing DLPAR CPU add/remove on a system under stress, > pseries_cpu_die() doesn't wait long enough for a CPU to die: > > [ 446.983944] cpu 148 (hwid 148) Ready to die... > [ 446.984062] cpu 149 (hwid 149) Ready to die... > [ 446.993518] cpu 150 (hwid 150) Ready to die... > [ 446.993543] Querying DEAD? cpu 150 (150) shows 2 > [ 446.994098] cpu 151 (hwid 151) Ready to die... > [ 447.133726] cpu 136 (hwid 136) Ready to die... > [ 447.403532] cpu 137 (hwid 137) Ready to die... > [ 447.403772] cpu 138 (hwid 138) Ready to die... > [ 447.403839] cpu 139 (hwid 139) Ready to die... > [ 447.403887] cpu 140 (hwid 140) Ready to die... > [ 447.403937] cpu 141 (hwid 141) Ready to die... > [ 447.403979] cpu 142 (hwid 142) Ready to die... > [ 447.404038] cpu 143 (hwid 143) Ready to die... > [ 447.513546] cpu 128 (hwid 128) Ready to die... > [ 447.693533] cpu 129 (hwid 129) Ready to die... > [ 447.693999] cpu 130 (hwid 130) Ready to die... > [ 447.703530] cpu 131 (hwid 131) Ready to die... > [ 447.704087] Querying DEAD? cpu 132 (132) shows 2 > [ 447.704102] cpu 132 (hwid 132) Ready to die... > [ 447.713534] cpu 133 (hwid 133) Ready to die... > [ 447.714064] Querying DEAD? cpu 134 (134) shows 2 > > This is a race between one CPU stopping and another one calling > pseries_cpu_die() to wait for it to stop. That function does a short busy > loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify > that it is stopped, but I think there's a lot for the stopping CPU to do > which may take longer than this loop allows. > > As can be seen in the dmesg right before or after the "Querying DEAD?" > messages, if pseries_cpu_die() waited a little longer it would have seen > the CPU in the stopped state. > > What I think is going on is that CPU 134 was inactive at the time it was > unplugged. In that case, dlpar_offline_cpu() calls H_PROD on that CPU and > immediately calls pseries_cpu_die(). Meanwhile, the prodded CPU activates > and start the process of stopping itself. The busy loop is not long enough > to allow for the CPU to wake up and complete the stopping process. > > This can be a problem because if the busy loop finishes too early, then the > kernel may offline another CPU before the previous one finished dying, > which would lead to two concurrent calls to rtas-stop-self, which is > prohibited by the PAPR. > > We can make the race a lot more even if we only start querying if the CPU > is stopped when the stopping CPU is close to call rtas_stop_self(). Since > pseries_mach_cpu_die() sets the CPU current state to offline almost > immediately before calling rtas_stop_self(), we use that as a signal that > it is either already stopped or very close to that point, and we can start > the busy loop. > > As suggested by Michael Ellerman, this patch also changes the busy loop to > wait for a fixed amount of wall time. Based on the measurements that > Gautham did on a POWER9 system, in successful cases of > smp_query_cpu_stopped(cpu) returning affirmative, the maximum time spent > inside the loop was was 10 ms. This patch loops for 20 ms just be sure. > > Signed-off-by: Thiago Jung Bauermann Thanks for this version. I have tested the patch and we no longer see the "Querying DEAD? cpu X (Y) shows 2" message. Tested-and-Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham.
Re: [PATCH 1/2] cpuidle : auto-promotion for cpuidle states
Hello Daniel, On Thu, Apr 4, 2019 at 3:52 PM Daniel Lezcano wrote: > > > Hi Abhishek, > > thanks for taking the time to test the different scenario and give us > the numbers. > > On 01/04/2019 07:11, Abhishek wrote: > > [.. snip..] > > In case of POWER, this is problematic, when the predicted state in the > aforementioned scenario is a lite stop state, as such lite states will > inhibit SMT folding, thereby depriving the other threads in the core > from > using the core resources. > > I can understand an idle state can prevent other threads to use the core > resources. But why a deeper idle state does not prevent this also? On POWER9, we have the following classes of platform idle states (called stop states) lite: These do not lose any context including the user context. In this state GPRs are also preserved (stop0_lite) shallow : These lose user context,but not the hypervisor context. So GPRs are lost but not SPRs. (stop0, stop1, stop2) deep: These lose hypervisor context. (stop4, stop5) In the case of lite stop state, only instruction dispatch on the CPU thread is paused. The thread does not give up its registers set in this state for the use of its busy sibling threads in the core. Hence, SMT folding does not happen in this state. With respect to shallow and deep states, not only is the instruction dispatch paused, it also gives up its registers set for the other siblings to use These stop states are beneficial for SMT folding. Hence, if a CPU thread remains in a lite state for too long, its siblings in the core would not be able to utilize the full resources of the core for that duration, thereby inhibiting single thread performance. This is not the case with non-lite states. -- Thanks and Regards gautham.
Re: [PATCH] powerpc/watchdog: Use hrtimers for per-CPU heartbeat
Hello Nicholas, On Tue, Apr 2, 2019 at 4:57 PM Nicholas Piggin wrote: > > Using a jiffies timer creates a dependency on the tick_do_timer_cpu > incrementing jiffies. If that CPU has locked up and jiffies is not > incrementing, the watchdog heartbeat timer for all CPUs stops and > creates false positives and confusing warnings on local CPUs, and > also causes the SMP detector to stop, so the root cause is never > detected. > > Fix this by using hrtimer based timers for the watchdog heartbeat, > like the generic kernel hardlockup detector. > > Reported-by: Ravikumar Bangoria > Signed-off-by: Nicholas Piggin [..snip..] > @@ -325,19 +325,21 @@ EXPORT_SYMBOL(arch_touch_nmi_watchdog); > > static void start_watchdog_timer_on(unsigned int cpu) > { > - struct timer_list *t = per_cpu_ptr(&wd_timer, cpu); > + struct hrtimer *hrtimer = this_cpu_ptr(&wd_hrtimer); This function can be called during the initialization via watchdog_nmi_start --> for_each_online_cpu(cpu) start_wd_on_cpu(cpu) --> start_watchdog_timer_on(cpu) Thus, it is not guarateed that we are always calling start_watchdog_timer_on() from the CPU where we want to start the watchdog timer. Thus, should we be calling this function from start_wd_on_cpu() via an smp_call_function_single() ? > > per_cpu(wd_timer_tb, cpu) = get_tb(); > > - timer_setup(t, wd_timer_fn, TIMER_PINNED); > - wd_timer_reset(cpu, t); > + hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > + hrtimer->function = watchdog_timer_fn; > + hrtimer_start(hrtimer, ms_to_ktime(wd_timer_period_ms), > + HRTIMER_MODE_REL_PINNED); > } > > static void stop_watchdog_timer_on(unsigned int cpu) > { > - struct timer_list *t = per_cpu_ptr(&wd_timer, cpu); > + struct hrtimer *hrtimer = this_cpu_ptr(&wd_hrtimer); > > - del_timer_sync(t); > + hrtimer_cancel(hrtimer); > } > > static int start_wd_on_cpu(unsigned int cpu) > -- > 2.20.1 > -- Thanks and Regards gautham.
Re: [PATCH v2] powerpc/watchdog: Use hrtimers for per-CPU heartbeat
Hello Nicholas, On Tue, Apr 09, 2019 at 02:40:05PM +1000, Nicholas Piggin wrote: > Using a jiffies timer creates a dependency on the tick_do_timer_cpu > incrementing jiffies. If that CPU has locked up and jiffies is not > incrementing, the watchdog heartbeat timer for all CPUs stops and > creates false positives and confusing warnings on local CPUs, and > also causes the SMP detector to stop, so the root cause is never > detected. > > Fix this by using hrtimer based timers for the watchdog heartbeat, > like the generic kernel hardlockup detector. > > Cc: Gautham R. Shenoy > Reported-by: Ravikumar Bangoria > Signed-off-by: Nicholas Piggin Looks good to me. Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham.
[PATCH v3 4/6] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU
From: "Gautham R. Shenoy" On Pseries LPARs, to calculate utilization, we need to know the [S]PURR ticks when the CPUs were busy or idle. The total PURR and SPURR ticks are already exposed via the per-cpu sysfs files "purr" and "spurr". This patch adds support for exposing the idle PURR and SPURR ticks via new per-cpu sysfs files named "idle_purr" and "idle_spurr". Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/sysfs.c | 54 ++--- 1 file changed, 51 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 479c706..c9ddb83 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include "cacheinfo.h" @@ -760,6 +761,42 @@ static void create_svm_file(void) } #endif /* CONFIG_PPC_SVM */ +static void read_idle_purr(void *val) +{ + u64 *ret = val; + + *ret = read_this_idle_purr(); +} + +static ssize_t idle_purr_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, dev); + u64 val; + + smp_call_function_single(cpu->dev.id, read_idle_purr, &val, 1); + return sprintf(buf, "%llx\n", val); +} +static DEVICE_ATTR(idle_purr, 0400, idle_purr_show, NULL); + +static void read_idle_spurr(void *val) +{ + u64 *ret = val; + + *ret = read_this_idle_spurr(); +} + +static ssize_t idle_spurr_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, dev); + u64 val; + + smp_call_function_single(cpu->dev.id, read_idle_spurr, &val, 1); + return sprintf(buf, "%llx\n", val); +} +static DEVICE_ATTR(idle_spurr, 0400, idle_spurr_show, NULL); + static int register_cpu_online(unsigned int cpu) { struct cpu *c = &per_cpu(cpu_devices, cpu); @@ -823,10 +860,15 @@ static int register_cpu_online(unsigned int cpu) if (!firmware_has_feature(FW_FEATURE_LPAR)) add_write_permission_dev_attr(&dev_attr_purr); device_create_file(s, &dev_attr_purr); + if (firmware_has_feature(FW_FEATURE_LPAR)) + device_create_file(s, &dev_attr_idle_purr); } - if (cpu_has_feature(CPU_FTR_SPURR)) + if (cpu_has_feature(CPU_FTR_SPURR)) { device_create_file(s, &dev_attr_spurr); + if (firmware_has_feature(FW_FEATURE_LPAR)) + device_create_file(s, &dev_attr_idle_spurr); + } if (cpu_has_feature(CPU_FTR_DSCR)) device_create_file(s, &dev_attr_dscr); @@ -910,11 +952,17 @@ static int unregister_cpu_online(unsigned int cpu) device_remove_file(s, &dev_attr_mmcra); #endif /* CONFIG_PMU_SYSFS */ - if (cpu_has_feature(CPU_FTR_PURR)) + if (cpu_has_feature(CPU_FTR_PURR)) { device_remove_file(s, &dev_attr_purr); + if (firmware_has_feature(FW_FEATURE_LPAR)) + device_remove_file(s, &dev_attr_idle_purr); + } - if (cpu_has_feature(CPU_FTR_SPURR)) + if (cpu_has_feature(CPU_FTR_SPURR)) { device_remove_file(s, &dev_attr_spurr); + if (firmware_has_feature(FW_FEATURE_LPAR)) + device_remove_file(s, &dev_attr_idle_spurr); + } if (cpu_has_feature(CPU_FTR_DSCR)) device_remove_file(s, &dev_attr_dscr); -- 1.9.4
[PATCH v3 2/6] powerpc/idle: Add accessor function to always read latest idle PURR
From: "Gautham R. Shenoy" Currently when CPU goes idle, we take a snapshot of PURR via pseries_idle_prolog() which is used at the CPU idle exit to compute the idle PURR cycles via the function pseries_idle_epilog(). Thus, the value of idle PURR cycle thus read before pseries_idle_prolog() and after pseries_idle_epilog() is always correct. However, if we were to read the idle PURR cycles from an interrupt context between pseries_idle_prolog() and pseries_idle_epilog() (this will be done in a future patch), then, the value of the idle PURR thus read will not include the cycles spent in the most recent idle period. This patch addresses the issue by providing accessor function to read the idle PURR such such that it includes the cycles spent in the most recent idle period, if we read it between pseries_idle_prolog() and pseries_idle_epilog(). In order to achieve it, the patch saves the snapshot of PURR in pseries_idle_prolog() in a per-cpu variable, instead of on the stack, so that it can be accessed from an interrupt context. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/idle.h| 46 +++--- arch/powerpc/platforms/pseries/setup.c | 7 +++--- drivers/cpuidle/cpuidle-pseries.c | 15 +-- 3 files changed, 46 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h index e838ea5..7552823 100644 --- a/arch/powerpc/include/asm/idle.h +++ b/arch/powerpc/include/asm/idle.h @@ -3,10 +3,27 @@ #define _ASM_POWERPC_IDLE_H #include -static inline void pseries_idle_prolog(unsigned long *in_purr) +DECLARE_PER_CPU(u64, idle_entry_purr_snap); + +static inline void snapshot_purr_idle_entry(void) +{ + *this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR); +} + +static inline void update_idle_purr_accounting(void) +{ + u64 wait_cycles; + u64 in_purr = *this_cpu_ptr(&idle_entry_purr_snap); + + wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); + wait_cycles += mfspr(SPRN_PURR) - in_purr; + get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); +} + +static inline void pseries_idle_prolog(void) { ppc64_runlatch_off(); - *in_purr = mfspr(SPRN_PURR); + snapshot_purr_idle_entry(); /* * Indicate to the HV that we are idle. Now would be * a good time to find other work to dispatch. @@ -14,15 +31,26 @@ static inline void pseries_idle_prolog(unsigned long *in_purr) get_lppaca()->idle = 1; } -static inline void pseries_idle_epilog(unsigned long in_purr) +static inline void pseries_idle_epilog(void) { - u64 wait_cycles; - - wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); - wait_cycles += mfspr(SPRN_PURR) - in_purr; - get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); + update_idle_purr_accounting(); get_lppaca()->idle = 0; - ppc64_runlatch_on(); } + +static inline u64 read_this_idle_purr(void) +{ + /* +* If we are reading from an idle context, update the +* idle-purr cycles corresponding to the last idle period. +* Since the idle context is not yet over, take a fresh +* snapshot of the idle-purr. +*/ + if (unlikely(get_lppaca()->idle == 1)) { + update_idle_purr_accounting(); + snapshot_purr_idle_entry(); + } + + return be64_to_cpu(get_lppaca()->wait_state_cycles); +} #endif diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 2f53e6b..4905c96 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -318,10 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void) } machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache); +DEFINE_PER_CPU(u64, idle_entry_purr_snap); static void pseries_lpar_idle(void) { - unsigned long in_purr; - /* * Default handler to go into low thread priority and possibly * low power mode by ceding processor to hypervisor @@ -331,7 +330,7 @@ static void pseries_lpar_idle(void) return; /* Indicate to hypervisor that we are idle. */ - pseries_idle_prolog(&in_purr); + pseries_idle_prolog(); /* * Yield the processor to the hypervisor. We return if @@ -342,7 +341,7 @@ static void pseries_lpar_idle(void) */ cede_processor(); - pseries_idle_epilog(in_purr); + pseries_idle_epilog(); } /* diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 46d5e05..6513ef2 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -36,12 +36,11 @@ static int snooze_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { - unsig
[PATCH v3 3/6] powerpc/pseries: Account for SPURR ticks on idle CPUs
From: "Gautham R. Shenoy" On Pseries LPARs, to calculate utilization, we need to know the [S]PURR ticks when the CPUs were busy or idle. Via pseries_idle_prolog(), pseries_idle_epilog(), we track the idle PURR ticks in the VPA variable "wait_state_cycles". This patch extends the support to account for the idle SPURR ticks. It also provides an accessor function to accurately reads idle SPURR ticks. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/idle.h| 33 + arch/powerpc/platforms/pseries/setup.c | 2 ++ 2 files changed, 35 insertions(+) diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h index 7552823..a375589 100644 --- a/arch/powerpc/include/asm/idle.h +++ b/arch/powerpc/include/asm/idle.h @@ -3,13 +3,20 @@ #define _ASM_POWERPC_IDLE_H #include +DECLARE_PER_CPU(u64, idle_spurr_cycles); DECLARE_PER_CPU(u64, idle_entry_purr_snap); +DECLARE_PER_CPU(u64, idle_entry_spurr_snap); static inline void snapshot_purr_idle_entry(void) { *this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR); } +static inline void snapshot_spurr_idle_entry(void) +{ + *this_cpu_ptr(&idle_entry_spurr_snap) = mfspr(SPRN_SPURR); +} + static inline void update_idle_purr_accounting(void) { u64 wait_cycles; @@ -20,10 +27,19 @@ static inline void update_idle_purr_accounting(void) get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); } +static inline void update_idle_spurr_accounting(void) +{ + u64 *idle_spurr_cycles_ptr = this_cpu_ptr(&idle_spurr_cycles); + u64 in_spurr = *this_cpu_ptr(&idle_entry_spurr_snap); + + *idle_spurr_cycles_ptr += mfspr(SPRN_SPURR) - in_spurr; +} + static inline void pseries_idle_prolog(void) { ppc64_runlatch_off(); snapshot_purr_idle_entry(); + snapshot_spurr_idle_entry(); /* * Indicate to the HV that we are idle. Now would be * a good time to find other work to dispatch. @@ -34,6 +50,7 @@ static inline void pseries_idle_prolog(void) static inline void pseries_idle_epilog(void) { update_idle_purr_accounting(); + update_idle_spurr_accounting(); get_lppaca()->idle = 0; ppc64_runlatch_on(); } @@ -53,4 +70,20 @@ static inline u64 read_this_idle_purr(void) return be64_to_cpu(get_lppaca()->wait_state_cycles); } + +static inline u64 read_this_idle_spurr(void) +{ + /* +* If we are reading from an idle context, update the +* idle-spurr cycles corresponding to the last idle period. +* Since the idle context is not yet over, take a fresh +* snapshot of the idle-spurr. +*/ + if (get_lppaca()->idle == 1) { + update_idle_spurr_accounting(); + snapshot_spurr_idle_entry(); + } + + return *this_cpu_ptr(&idle_spurr_cycles); +} #endif diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 4905c96..1b55e80 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -318,7 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void) } machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache); +DEFINE_PER_CPU(u64, idle_spurr_cycles); DEFINE_PER_CPU(u64, idle_entry_purr_snap); +DEFINE_PER_CPU(u64, idle_entry_spurr_snap); static void pseries_lpar_idle(void) { /* -- 1.9.4
[PATCH v3 5/6] Documentation: Document sysfs interfaces purr, spurr, idle_purr, idle_spurr
From: "Gautham R. Shenoy" Add documentation for the following sysfs interfaces: /sys/devices/system/cpu/cpuX/purr /sys/devices/system/cpu/cpuX/spurr /sys/devices/system/cpu/cpuX/idle_purr /sys/devices/system/cpu/cpuX/idle_spurr Signed-off-by: Gautham R. Shenoy --- Documentation/ABI/testing/sysfs-devices-system-cpu | 39 ++ 1 file changed, 39 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 2e0e3b4..bc07677 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -580,3 +580,42 @@ Description: Secure Virtual Machine If 1, it means the system is using the Protected Execution Facility in POWER9 and newer processors. i.e., it is a Secure Virtual Machine. + +What: /sys/devices/system/cpu/cpuX/purr +Date: Apr 2005 +Contact: Linux for PowerPC mailing list +Description: PURR ticks for this CPU since the system boot. + + The Processor Utilization Resources Register (PURR) is + a 64-bit counter which provides an estimate of the + resources used by the CPU thread. The contents of this + register increases monotonically. This sysfs interface + exposes the number of PURR ticks for cpuX. + +What: /sys/devices/system/cpu/cpuX/spurr +Date: Dec 2006 +Contact: Linux for PowerPC mailing list +Description: SPURR ticks for this CPU since the system boot. + + The Scaled Processor Utilization Resources Register + (SPURR) is a 64-bit counter that provides a frequency + invariant estimate of the resources used by the CPU + thread. The contents of this register increases + monotonically. This sysfs interface exposes the number + of SPURR ticks for cpuX. + +What: /sys/devices/system/cpu/cpuX/idle_purr +Date: Mar 2020 +Contact: Linux for PowerPC mailing list +Description: PURR ticks for cpuX when it was idle. + + This sysfs interface exposes the number of PURR ticks + for cpuX when it was idle. + +What: /sys/devices/system/cpu/cpuX/idle_spurr +Date: Mar 2020 +Contact: Linux for PowerPC mailing list +Description: SPURR ticks for cpuX when it was idle. + + This sysfs interface exposes the number of SPURR ticks + for cpuX when it was idle. -- 1.9.4
[PATCH v3 1/6] powerpc: Move idle_loop_prolog()/epilog() functions to header file
From: "Gautham R. Shenoy" Currently prior to entering an idle state on a Linux Guest, the pseries cpuidle driver implement an idle_loop_prolog() and idle_loop_epilog() functions which ensure that idle_purr is correctly computed, and the hypervisor is informed that the CPU cycles have been donated. These prolog and epilog functions are also required in the default idle call, i.e pseries_lpar_idle(). Hence move these accessor functions to a common header file and call them from pseries_lpar_idle(). Since the existing header files such as asm/processor.h have enough clutter, create a new header file asm/idle.h. Finally rename idle_loop_prolog() and idle_loop_epilog() to pseries_idle_prolog() and pseries_idle_epilog() as they are only relavent for on pseries guests. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/idle.h| 28 ++ arch/powerpc/platforms/pseries/setup.c | 7 +-- drivers/cpuidle/cpuidle-pseries.c | 36 +++--- 3 files changed, 40 insertions(+), 31 deletions(-) create mode 100644 arch/powerpc/include/asm/idle.h diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h new file mode 100644 index 000..e838ea5 --- /dev/null +++ b/arch/powerpc/include/asm/idle.h @@ -0,0 +1,28 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _ASM_POWERPC_IDLE_H +#define _ASM_POWERPC_IDLE_H +#include + +static inline void pseries_idle_prolog(unsigned long *in_purr) +{ + ppc64_runlatch_off(); + *in_purr = mfspr(SPRN_PURR); + /* +* Indicate to the HV that we are idle. Now would be +* a good time to find other work to dispatch. +*/ + get_lppaca()->idle = 1; +} + +static inline void pseries_idle_epilog(unsigned long in_purr) +{ + u64 wait_cycles; + + wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); + wait_cycles += mfspr(SPRN_PURR) - in_purr; + get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); + get_lppaca()->idle = 0; + + ppc64_runlatch_on(); +} +#endif diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 0c8421d..2f53e6b 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -68,6 +68,7 @@ #include #include #include +#include #include #include @@ -319,6 +320,8 @@ static int alloc_dispatch_log_kmem_cache(void) static void pseries_lpar_idle(void) { + unsigned long in_purr; + /* * Default handler to go into low thread priority and possibly * low power mode by ceding processor to hypervisor @@ -328,7 +331,7 @@ static void pseries_lpar_idle(void) return; /* Indicate to hypervisor that we are idle. */ - get_lppaca()->idle = 1; + pseries_idle_prolog(&in_purr); /* * Yield the processor to the hypervisor. We return if @@ -339,7 +342,7 @@ static void pseries_lpar_idle(void) */ cede_processor(); - get_lppaca()->idle = 0; + pseries_idle_epilog(in_purr); } /* diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 74c2479..46d5e05 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -19,6 +19,7 @@ #include #include #include +#include #include struct cpuidle_driver pseries_idle_driver = { @@ -31,29 +32,6 @@ struct cpuidle_driver pseries_idle_driver = { static u64 snooze_timeout __read_mostly; static bool snooze_timeout_en __read_mostly; -static inline void idle_loop_prolog(unsigned long *in_purr) -{ - ppc64_runlatch_off(); - *in_purr = mfspr(SPRN_PURR); - /* -* Indicate to the HV that we are idle. Now would be -* a good time to find other work to dispatch. -*/ - get_lppaca()->idle = 1; -} - -static inline void idle_loop_epilog(unsigned long in_purr) -{ - u64 wait_cycles; - - wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); - wait_cycles += mfspr(SPRN_PURR) - in_purr; - get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); - get_lppaca()->idle = 0; - - ppc64_runlatch_on(); -} - static int snooze_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) @@ -63,7 +41,7 @@ static int snooze_loop(struct cpuidle_device *dev, set_thread_flag(TIF_POLLING_NRFLAG); - idle_loop_prolog(&in_purr); + pseries_idle_prolog(&in_purr); local_irq_enable(); snooze_exit_time = get_tb() + snooze_timeout; @@ -87,7 +65,7 @@ static int snooze_loop(struct cpuidle_device *dev, local_irq_disable(); - idle_loop_epilog(in_purr); + pseries_idle_epilog(in_purr); return index; } @@ -115,7 +93,7 @@ static int dedicated_cede_loop(struct c
[PATCH v3 0/6] Track and expose idle PURR and SPURR ticks
From: "Gautham R. Shenoy" Hi, This is the third version of the patches to track and expose idle PURR and SPURR ticks. These patches are required by tools such as lparstat to compute system utilization for capacity planning purposes. The previous versions can be found here: v2: https://lkml.org/lkml/2020/2/21/21 v1: https://lore.kernel.org/patchwork/cover/1159341/ They key changes from v2 are: - The prolog and epilog functions have been named pseries_idle_prolog() and pseries_idle_epilog() respectively to indicate their pseries specific nature. - Fixed the Documentation for /sys/devices/system/cpu/cpuX/idle_spurr as pointed out by Nathan Lynch. - Introduces a patch (Patch 6/6) to send an IPI in order to read and cache the values of purr, spurr, idle_purr and idle_spurr of the target CPU when any one of them is read via sysfs. These cached values will be presented if any of these sysfs are read within the next 10ms. If these sysfs files are read after 10ms from the earlier IPI, a fresh IPI is issued to read and cache the values again. This minimizes the number of IPIs required to be sent when these values are read back-to-back via the sysfs interface. Test-results: While reading the four sysfs files back-to-back for a given CPU every second for 100 seconds. Without patch 6/6 (Without caching): 16 [XICS 2 Edge IPI] = 422 times DBL [Doorbell interrupts] = 13 times Total : 435 IPIs. With patch 6/6 (With caching): 16 [XICS 2 Edge IPI] = 111 times DBL [Doorbell interrupts] = 17 times Total : 128 IPIs. Motivation: === On PSeries LPARs, the data centers planners desire a more accurate view of system utilization per resource such as CPU to plan the system capacity requirements better. Such accuracy can be obtained by reading PURR/SPURR registers for CPU resource utilization. Tools such as lparstat which are used to compute the utilization need to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR counters are already exposed through sysfs. We already account for PURR ticks when we go to idle so that we can update the VPA area. This patchset extends support to account for SPURR ticks when idle, and expose both via per-cpu sysfs files. These patches are required for enhancement to the lparstat utility that compute the CPU utilization based on PURR and SPURR which can be found here : https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4 With the patches, when lparstat is run on a LPAR running CPU-Hogs, = sudo ./src/lparstat -E 1 3 System Configuration type=Dedicated mode=Capped smt=8 lcpu=2 mem=4834176 kB cpus=0 ent=2.00 ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 1 99.99 0.00 3.35GHz[111%] 110.99 0.00 2 100.00 0.00 3.35GHz[111%] 111.00 0.00 3 100.00 0.00 3.35GHz[111%] 111.00 0.00 = When lparstat is run on an LPAR that is idle, = $ sudo ./src/lparstat -E 1 3 System Configuration type=Dedicated mode=Capped smt=8 lcpu=2 mem=4834176 kB cpus=0 ent=2.00 ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 1 0.71 99.30 2.18GHz[ 72%] 0.53 71.48 2 0.56 99.44 2.11GHz[ 70%] 0.43 69.57 3 0.54 99.46 2.11GHz[ 70%] 0.43 69.57 ===== Gautham R. Shenoy (6): powerpc: Move idle_loop_prolog()/epilog() functions to header file powerpc/idle: Add accessor function to always read latest idle PURR powerpc/pseries: Account for SPURR ticks on idle CPUs powerpc/sysfs: Show idle_purr and idle_spurr for every CPU Documentation: Document sysfs interfaces purr, spurr, idle_purr, idle_spurr pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr Documentation/ABI/testing/sysfs-devices-system-cpu | 39 ++ arch/powerpc/include/asm/idle.h| 89 ++ arch/powerpc/kernel/sysfs.c| 133 +++-- arch/powerpc/platforms/pseries/setup.c | 8 +- drivers/cpuidle/cpuidle-pseries.c | 39 ++ 5 files changed, 267 insertions(+), 41 deletions(-) create mode 100644 arch/powerpc/include/asm/idle.h -- 1.9.4
[PATCH v3 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr
From: "Gautham R. Shenoy" Currently purr, spurr, idle_purr, idle_spurr are exposed for every CPU via the sysfs interface /sys/devices/system/cpu/cpuX/[idle_][s]purr. Each sysfs read currently generates an IPI to obtain the desired value from the target CPU X. Since these aforementioned sysfs files are typically read one after another, we end up generating 4 IPIs per CPU in a short duration. In order to minimize the IPI noise, this patch caches the values of all the four entities whenever one of them is read. If subsequently any of these are read within the next 10ms, the cached value is returned. With this, we will generate at most one IPI every 10ms for every CPU. Test-results: While reading the four sysfs files back-to-back for a given CPU every second for 100 seconds. Without the patch: 16 [XICS 2 Edge IPI] = 422 times DBL [Doorbell interrupts] = 13 times Total : 435 IPIs. With the patch: 16 [XICS 2 Edge IPI] = 111 times DBL [Doorbell interrupts] = 17 times Total : 128 IPIs. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/sysfs.c | 109 1 file changed, 90 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index c9ddb83..db8fc90 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -586,8 +586,6 @@ void ppc_enable_pmcs(void) * SPRs which are not related to PMU. */ #ifdef CONFIG_PPC64 -SYSFS_SPRSETUP(purr, SPRN_PURR); -SYSFS_SPRSETUP(spurr, SPRN_SPURR); SYSFS_SPRSETUP(pir, SPRN_PIR); SYSFS_SPRSETUP(tscr, SPRN_TSCR); @@ -596,8 +594,6 @@ void ppc_enable_pmcs(void) enable write when needed with a separate function. Lets be conservative and default to pseries. */ -static DEVICE_ATTR(spurr, 0400, show_spurr, NULL); -static DEVICE_ATTR(purr, 0400, show_purr, store_purr); static DEVICE_ATTR(pir, 0400, show_pir, NULL); static DEVICE_ATTR(tscr, 0600, show_tscr, store_tscr); #endif /* CONFIG_PPC64 */ @@ -761,39 +757,114 @@ static void create_svm_file(void) } #endif /* CONFIG_PPC_SVM */ -static void read_idle_purr(void *val) +/* + * The duration (in ms) from the last IPI to the target CPU until + * which a cached value of purr, spurr, idle_purr, idle_spurr can be + * reported to the user on a corresponding sysfs file read. Beyond + * this duration, fresh values need to be obtained by sending IPIs to + * the target CPU when the sysfs files are read. + */ +static unsigned long util_stats_staleness_tolerance_ms = 10; +struct util_acct_stats { + u64 latest_purr; + u64 latest_spurr; + u64 latest_idle_purr; + u64 latest_idle_spurr; + unsigned long last_update_jiffies; +}; + +DEFINE_PER_CPU(struct util_acct_stats, util_acct_stats); + +static void update_util_acct_stats(void *ptr) { - u64 *ret = val; + struct util_acct_stats *stats = ptr; - *ret = read_this_idle_purr(); + stats->latest_purr = mfspr(SPRN_PURR); + stats->latest_spurr = mfspr(SPRN_SPURR); + stats->latest_idle_purr = read_this_idle_purr(); + stats->latest_idle_spurr = read_this_idle_spurr(); + stats->last_update_jiffies = jiffies; } -static ssize_t idle_purr_show(struct device *dev, - struct device_attribute *attr, char *buf) +struct util_acct_stats *get_util_stats_ptr(int cpu) +{ + struct util_acct_stats *stats = per_cpu_ptr(&util_acct_stats, cpu); + unsigned long delta_jiffies; + + delta_jiffies = jiffies - stats->last_update_jiffies; + + /* +* If we have a recent enough data, reuse that instead of +* sending an IPI. +*/ + if (jiffies_to_msecs(delta_jiffies) < util_stats_staleness_tolerance_ms) + return stats; + + smp_call_function_single(cpu, update_util_acct_stats, stats, 1); + return stats; +} + +static ssize_t show_purr(struct device *dev, +struct device_attribute *attr, char *buf) { struct cpu *cpu = container_of(dev, struct cpu, dev); - u64 val; + struct util_acct_stats *stats; - smp_call_function_single(cpu->dev.id, read_idle_purr, &val, 1); - return sprintf(buf, "%llx\n", val); + stats = get_util_stats_ptr(cpu->dev.id); + return sprintf(buf, "%llx\n", stats->latest_purr); } -static DEVICE_ATTR(idle_purr, 0400, idle_purr_show, NULL); -static void read_idle_spurr(void *val) +static void write_purr(void *val) { - u64 *ret = val; + mtspr(SPRN_PURR, *(unsigned long *)val); +} - *ret = read_this_idle_spurr(); +static ssize_t __used store_purr(struct device *dev, +struct device_attribute *attr, +const char *buf, size_t count) +{ + struct cpu *cpu = container_of(dev, struct cpu,
Re: [PATCH] cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_work_fn
On Mon, Mar 16, 2020 at 07:27:43PM +0530, Pratik Rajesh Sampat wrote: > The patch avoids allocating cpufreq_policy on stack hence fixing frame > size overflow in 'powernv_cpufreq_work_fn' > Thanks for fixing this. > Fixes: 227942809b52 ("cpufreq: powernv: Restore cpu frequency to policy->cur > on unthrottling") > Signed-off-by: Pratik Rajesh Sampat Reviewed-by: Gautham R. Shenoy > --- > drivers/cpufreq/powernv-cpufreq.c | 13 - > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 56f4bc0d209e..20ee0661555a 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -902,6 +902,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = { > void powernv_cpufreq_work_fn(struct work_struct *work) > { > struct chip *chip = container_of(work, struct chip, throttle); > + struct cpufreq_policy *policy; > unsigned int cpu; > cpumask_t mask; > > @@ -916,12 +917,14 @@ void powernv_cpufreq_work_fn(struct work_struct *work) > chip->restore = false; > for_each_cpu(cpu, &mask) { > int index; > - struct cpufreq_policy policy; > > - cpufreq_get_policy(&policy, cpu); > - index = cpufreq_table_find_index_c(&policy, policy.cur); > - powernv_cpufreq_target_index(&policy, index); > - cpumask_andnot(&mask, &mask, policy.cpus); > + policy = cpufreq_cpu_get(cpu); > + if (!policy) > + continue; > + index = cpufreq_table_find_index_c(policy, policy->cur); > + powernv_cpufreq_target_index(policy, index); > + cpumask_andnot(&mask, &mask, policy->cpus); > + cpufreq_cpu_put(policy); > } > out: > put_online_cpus(); > -- > 2.24.1 >
[PATCH v4 0/6] [PATCH v4 0/6] Track and expose idle PURR and SPURR ticks
From: "Gautham R. Shenoy" Hi, This is the fourth version of the patches to track and expose idle PURR and SPURR ticks. These patches are required by tools such as lparstat to compute system utilization for capacity planning purposes. The previous versions can be found here: v3: https://lkml.org/lkml/2020/3/11/331 v2: https://lkml.org/lkml/2020/2/21/21 v1: https://lore.kernel.org/patchwork/cover/1159341/ They key changes from v3 are: - Fixed the build errors on !CONFIG_PPC64 and !CONFIG_PPC_PSERIES configurations notified by the kbuild bot. Motivation: === On PSeries LPARs, the data centers planners desire a more accurate view of system utilization per resource such as CPU to plan the system capacity requirements better. Such accuracy can be obtained by reading PURR/SPURR registers for CPU resource utilization. Tools such as lparstat which are used to compute the utilization need to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR counters are already exposed through sysfs. We already account for PURR ticks when we go to idle so that we can update the VPA area. This patchset extends support to account for SPURR ticks when idle, and expose both via per-cpu sysfs files. This patch series also introduces a patch (Patch 6/6) to send an IPI in order to read and cache the values of purr, spurr, idle_purr and idle_spurr of the target CPU when any one of them is read via sysfs. These cached values will be presented if any of these sysfs are read within the next 10ms. If these sysfs files are read after 10ms from the earlier IPI, a fresh IPI is issued to read and cache the values again. This minimizes the number of IPIs required to be sent when these values are read back-to-back via the sysfs interface. Without patch 6/6 (Without caching): 16 [XICS 2 Edge IPI] = 422 times DBL [Doorbell interrupts] = 13 times Total : 435 IPIs. With patch 6/6 (With caching): 16 [XICS 2 Edge IPI] = 111 times DBL [Doorbell interrupts] = 17 times Total : 128 IPIs. These patches are required for enhancement to the lparstat utility that compute the CPU utilization based on PURR and SPURR which can be found here : https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4 With the patches, when lparstat is run on a LPAR running CPU-Hogs, = sudo ./src/lparstat -E 1 3 System Configuration type=Dedicated mode=Capped smt=8 lcpu=2 mem=4834176 kB cpus=0 ent=2.00 ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 99.99 0.00 3.35GHz[111%] 110.99 0.00 100.00 0.00 3.35GHz[111%] 111.00 0.00 100.00 0.00 3.35GHz[111%] 111.00 0.00 With patches, when lparstat is run on and idle LPAR = ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 0.20 99.81 2.17GHz[ 72%] 0.19 71.82 0.42 99.58 2.11GHz[ 70%] 0.31 69.69 0.41 99.59 2.11GHz[ 70%] 0.31 69.69 Gautham R. Shenoy (6): powerpc: Move idle_loop_prolog()/epilog() functions to header file powerpc/idle: Add accessor function to always read latest idle PURR powerpc/pseries: Account for SPURR ticks on idle CPUs powerpc/sysfs: Show idle_purr and idle_spurr for every CPU Documentation: Document sysfs interfaces purr, spurr, idle_purr, idle_spurr pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr Documentation/ABI/testing/sysfs-devices-system-cpu | 39 + arch/powerpc/include/asm/idle.h| 93 arch/powerpc/kernel/sysfs.c| 167 - arch/powerpc/platforms/pseries/setup.c | 8 +- drivers/cpuidle/cpuidle-pseries.c | 39 + 5 files changed, 305 insertions(+), 41 deletions(-) create mode 100644 arch/powerpc/include/asm/idle.h -- 1.9.4
[PATCH v4 2/6] powerpc/idle: Add accessor function to always read latest idle PURR
From: "Gautham R. Shenoy" Currently when CPU goes idle, we take a snapshot of PURR via pseries_idle_prolog() which is used at the CPU idle exit to compute the idle PURR cycles via the function pseries_idle_epilog(). Thus, the value of idle PURR cycle thus read before pseries_idle_prolog() and after pseries_idle_epilog() is always correct. However, if we were to read the idle PURR cycles from an interrupt context between pseries_idle_prolog() and pseries_idle_epilog() (this will be done in a future patch), then, the value of the idle PURR thus read will not include the cycles spent in the most recent idle period. This patch addresses the issue by providing accessor function to read the idle PURR such such that it includes the cycles spent in the most recent idle period, if we read it between pseries_idle_prolog() and pseries_idle_epilog(). In order to achieve it, the patch saves the snapshot of PURR in pseries_idle_prolog() in a per-cpu variable, instead of on the stack, so that it can be accessed from an interrupt context. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/idle.h| 47 +++--- arch/powerpc/platforms/pseries/setup.c | 7 +++-- drivers/cpuidle/cpuidle-pseries.c | 15 +-- 3 files changed, 47 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h index 32064a4c..d4bfb6a 100644 --- a/arch/powerpc/include/asm/idle.h +++ b/arch/powerpc/include/asm/idle.h @@ -5,10 +5,27 @@ #include #ifdef CONFIG_PPC_PSERIES -static inline void pseries_idle_prolog(unsigned long *in_purr) +DECLARE_PER_CPU(u64, idle_entry_purr_snap); + +static inline void snapshot_purr_idle_entry(void) +{ + *this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR); +} + +static inline void update_idle_purr_accounting(void) +{ + u64 wait_cycles; + u64 in_purr = *this_cpu_ptr(&idle_entry_purr_snap); + + wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); + wait_cycles += mfspr(SPRN_PURR) - in_purr; + get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); +} + +static inline void pseries_idle_prolog(void) { ppc64_runlatch_off(); - *in_purr = mfspr(SPRN_PURR); + snapshot_purr_idle_entry(); /* * Indicate to the HV that we are idle. Now would be * a good time to find other work to dispatch. @@ -16,16 +33,28 @@ static inline void pseries_idle_prolog(unsigned long *in_purr) get_lppaca()->idle = 1; } -static inline void pseries_idle_epilog(unsigned long in_purr) +static inline void pseries_idle_epilog(void) { - u64 wait_cycles; - - wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); - wait_cycles += mfspr(SPRN_PURR) - in_purr; - get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); + update_idle_purr_accounting(); get_lppaca()->idle = 0; - ppc64_runlatch_on(); } + +static inline u64 read_this_idle_purr(void) +{ + /* +* If we are reading from an idle context, update the +* idle-purr cycles corresponding to the last idle period. +* Since the idle context is not yet over, take a fresh +* snapshot of the idle-purr. +*/ + if (unlikely(get_lppaca()->idle == 1)) { + update_idle_purr_accounting(); + snapshot_purr_idle_entry(); + } + + return be64_to_cpu(get_lppaca()->wait_state_cycles); +} + #endif /* CONFIG_PPC_PSERIES */ #endif diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 2f53e6b..4905c96 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -318,10 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void) } machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache); +DEFINE_PER_CPU(u64, idle_entry_purr_snap); static void pseries_lpar_idle(void) { - unsigned long in_purr; - /* * Default handler to go into low thread priority and possibly * low power mode by ceding processor to hypervisor @@ -331,7 +330,7 @@ static void pseries_lpar_idle(void) return; /* Indicate to hypervisor that we are idle. */ - pseries_idle_prolog(&in_purr); + pseries_idle_prolog(); /* * Yield the processor to the hypervisor. We return if @@ -342,7 +341,7 @@ static void pseries_lpar_idle(void) */ cede_processor(); - pseries_idle_epilog(in_purr); + pseries_idle_epilog(); } /* diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 46d5e05..6513ef2 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -36,12 +36,11 @@ static int snooze_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv,
[PATCH v4 3/6] powerpc/pseries: Account for SPURR ticks on idle CPUs
From: "Gautham R. Shenoy" On Pseries LPARs, to calculate utilization, we need to know the [S]PURR ticks when the CPUs were busy or idle. Via pseries_idle_prolog(), pseries_idle_epilog(), we track the idle PURR ticks in the VPA variable "wait_state_cycles". This patch extends the support to account for the idle SPURR ticks. It also provides an accessor function to accurately reads idle SPURR ticks. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/idle.h| 33 + arch/powerpc/platforms/pseries/setup.c | 2 ++ 2 files changed, 35 insertions(+) diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h index d4bfb6a..accd1f5 100644 --- a/arch/powerpc/include/asm/idle.h +++ b/arch/powerpc/include/asm/idle.h @@ -5,13 +5,20 @@ #include #ifdef CONFIG_PPC_PSERIES +DECLARE_PER_CPU(u64, idle_spurr_cycles); DECLARE_PER_CPU(u64, idle_entry_purr_snap); +DECLARE_PER_CPU(u64, idle_entry_spurr_snap); static inline void snapshot_purr_idle_entry(void) { *this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR); } +static inline void snapshot_spurr_idle_entry(void) +{ + *this_cpu_ptr(&idle_entry_spurr_snap) = mfspr(SPRN_SPURR); +} + static inline void update_idle_purr_accounting(void) { u64 wait_cycles; @@ -22,10 +29,19 @@ static inline void update_idle_purr_accounting(void) get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); } +static inline void update_idle_spurr_accounting(void) +{ + u64 *idle_spurr_cycles_ptr = this_cpu_ptr(&idle_spurr_cycles); + u64 in_spurr = *this_cpu_ptr(&idle_entry_spurr_snap); + + *idle_spurr_cycles_ptr += mfspr(SPRN_SPURR) - in_spurr; +} + static inline void pseries_idle_prolog(void) { ppc64_runlatch_off(); snapshot_purr_idle_entry(); + snapshot_spurr_idle_entry(); /* * Indicate to the HV that we are idle. Now would be * a good time to find other work to dispatch. @@ -36,6 +52,7 @@ static inline void pseries_idle_prolog(void) static inline void pseries_idle_epilog(void) { update_idle_purr_accounting(); + update_idle_spurr_accounting(); get_lppaca()->idle = 0; ppc64_runlatch_on(); } @@ -56,5 +73,21 @@ static inline u64 read_this_idle_purr(void) return be64_to_cpu(get_lppaca()->wait_state_cycles); } +static inline u64 read_this_idle_spurr(void) +{ + /* +* If we are reading from an idle context, update the +* idle-spurr cycles corresponding to the last idle period. +* Since the idle context is not yet over, take a fresh +* snapshot of the idle-spurr. +*/ + if (get_lppaca()->idle == 1) { + update_idle_spurr_accounting(); + snapshot_spurr_idle_entry(); + } + + return *this_cpu_ptr(&idle_spurr_cycles); +} + #endif /* CONFIG_PPC_PSERIES */ #endif diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 4905c96..1b55e80 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -318,7 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void) } machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache); +DEFINE_PER_CPU(u64, idle_spurr_cycles); DEFINE_PER_CPU(u64, idle_entry_purr_snap); +DEFINE_PER_CPU(u64, idle_entry_spurr_snap); static void pseries_lpar_idle(void) { /* -- 1.9.4
[PATCH v4 1/6] powerpc: Move idle_loop_prolog()/epilog() functions to header file
From: "Gautham R. Shenoy" Currently prior to entering an idle state on a Linux Guest, the pseries cpuidle driver implement an idle_loop_prolog() and idle_loop_epilog() functions which ensure that idle_purr is correctly computed, and the hypervisor is informed that the CPU cycles have been donated. These prolog and epilog functions are also required in the default idle call, i.e pseries_lpar_idle(). Hence move these accessor functions to a common header file and call them from pseries_lpar_idle(). Since the existing header files such as asm/processor.h have enough clutter, create a new header file asm/idle.h. Finally rename idle_loop_prolog() and idle_loop_epilog() to pseries_idle_prolog() and pseries_idle_epilog() as they are only relavent for on pseries guests. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/idle.h| 31 + arch/powerpc/platforms/pseries/setup.c | 7 +-- drivers/cpuidle/cpuidle-pseries.c | 36 +++--- 3 files changed, 43 insertions(+), 31 deletions(-) create mode 100644 arch/powerpc/include/asm/idle.h diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h new file mode 100644 index 000..32064a4c --- /dev/null +++ b/arch/powerpc/include/asm/idle.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _ASM_POWERPC_IDLE_H +#define _ASM_POWERPC_IDLE_H +#include +#include + +#ifdef CONFIG_PPC_PSERIES +static inline void pseries_idle_prolog(unsigned long *in_purr) +{ + ppc64_runlatch_off(); + *in_purr = mfspr(SPRN_PURR); + /* +* Indicate to the HV that we are idle. Now would be +* a good time to find other work to dispatch. +*/ + get_lppaca()->idle = 1; +} + +static inline void pseries_idle_epilog(unsigned long in_purr) +{ + u64 wait_cycles; + + wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); + wait_cycles += mfspr(SPRN_PURR) - in_purr; + get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); + get_lppaca()->idle = 0; + + ppc64_runlatch_on(); +} +#endif /* CONFIG_PPC_PSERIES */ +#endif diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 0c8421d..2f53e6b 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -68,6 +68,7 @@ #include #include #include +#include #include #include @@ -319,6 +320,8 @@ static int alloc_dispatch_log_kmem_cache(void) static void pseries_lpar_idle(void) { + unsigned long in_purr; + /* * Default handler to go into low thread priority and possibly * low power mode by ceding processor to hypervisor @@ -328,7 +331,7 @@ static void pseries_lpar_idle(void) return; /* Indicate to hypervisor that we are idle. */ - get_lppaca()->idle = 1; + pseries_idle_prolog(&in_purr); /* * Yield the processor to the hypervisor. We return if @@ -339,7 +342,7 @@ static void pseries_lpar_idle(void) */ cede_processor(); - get_lppaca()->idle = 0; + pseries_idle_epilog(in_purr); } /* diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 74c2479..46d5e05 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -19,6 +19,7 @@ #include #include #include +#include #include struct cpuidle_driver pseries_idle_driver = { @@ -31,29 +32,6 @@ struct cpuidle_driver pseries_idle_driver = { static u64 snooze_timeout __read_mostly; static bool snooze_timeout_en __read_mostly; -static inline void idle_loop_prolog(unsigned long *in_purr) -{ - ppc64_runlatch_off(); - *in_purr = mfspr(SPRN_PURR); - /* -* Indicate to the HV that we are idle. Now would be -* a good time to find other work to dispatch. -*/ - get_lppaca()->idle = 1; -} - -static inline void idle_loop_epilog(unsigned long in_purr) -{ - u64 wait_cycles; - - wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); - wait_cycles += mfspr(SPRN_PURR) - in_purr; - get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); - get_lppaca()->idle = 0; - - ppc64_runlatch_on(); -} - static int snooze_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) @@ -63,7 +41,7 @@ static int snooze_loop(struct cpuidle_device *dev, set_thread_flag(TIF_POLLING_NRFLAG); - idle_loop_prolog(&in_purr); + pseries_idle_prolog(&in_purr); local_irq_enable(); snooze_exit_time = get_tb() + snooze_timeout; @@ -87,7 +65,7 @@ static int snooze_loop(struct cpuidle_device *dev, local_irq_disable(); - idle_loop_epilog(in_purr); + pseries_idle_epilog(in_purr);
[PATCH v4 5/6] Documentation: Document sysfs interfaces purr, spurr, idle_purr, idle_spurr
From: "Gautham R. Shenoy" Add documentation for the following sysfs interfaces: /sys/devices/system/cpu/cpuX/purr /sys/devices/system/cpu/cpuX/spurr /sys/devices/system/cpu/cpuX/idle_purr /sys/devices/system/cpu/cpuX/idle_spurr Signed-off-by: Gautham R. Shenoy --- Documentation/ABI/testing/sysfs-devices-system-cpu | 39 ++ 1 file changed, 39 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 2e0e3b4..bc07677 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -580,3 +580,42 @@ Description: Secure Virtual Machine If 1, it means the system is using the Protected Execution Facility in POWER9 and newer processors. i.e., it is a Secure Virtual Machine. + +What: /sys/devices/system/cpu/cpuX/purr +Date: Apr 2005 +Contact: Linux for PowerPC mailing list +Description: PURR ticks for this CPU since the system boot. + + The Processor Utilization Resources Register (PURR) is + a 64-bit counter which provides an estimate of the + resources used by the CPU thread. The contents of this + register increases monotonically. This sysfs interface + exposes the number of PURR ticks for cpuX. + +What: /sys/devices/system/cpu/cpuX/spurr +Date: Dec 2006 +Contact: Linux for PowerPC mailing list +Description: SPURR ticks for this CPU since the system boot. + + The Scaled Processor Utilization Resources Register + (SPURR) is a 64-bit counter that provides a frequency + invariant estimate of the resources used by the CPU + thread. The contents of this register increases + monotonically. This sysfs interface exposes the number + of SPURR ticks for cpuX. + +What: /sys/devices/system/cpu/cpuX/idle_purr +Date: Mar 2020 +Contact: Linux for PowerPC mailing list +Description: PURR ticks for cpuX when it was idle. + + This sysfs interface exposes the number of PURR ticks + for cpuX when it was idle. + +What: /sys/devices/system/cpu/cpuX/idle_spurr +Date: Mar 2020 +Contact: Linux for PowerPC mailing list +Description: SPURR ticks for cpuX when it was idle. + + This sysfs interface exposes the number of SPURR ticks + for cpuX when it was idle. -- 1.9.4
[PATCH v4 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr
From: "Gautham R. Shenoy" Currently purr, spurr, idle_purr, idle_spurr are exposed for every CPU via the sysfs interface /sys/devices/system/cpu/cpuX/[idle_][s]purr. Each sysfs read currently generates an IPI to obtain the desired value from the target CPU X. Since these aforementioned sysfs are typically read one after another, we end up generating 4 IPIs per CPU in a short duration. In order to minimize the IPI noise, this patch caches the values of all the four entities whenever one of them is read. If subsequently any of these are read within the next 10ms, the cached value is returned. With this, we will generate at most one IPI every 10ms for every CPU. Test-results: While reading the four sysfs files back-to-back for a given CPU every second for 100 seconds. Without the patch: 16 [XICS 2 Edge IPI] = 422 times DBL [Doorbell interrupts] = 13 times Total : 435 IPIs. With the patch: 16 [XICS 2 Edge IPI] = 111 times DBL [Doorbell interrupts] = 17 times Total : 128 IPIs. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/sysfs.c | 117 1 file changed, 97 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 571b325..bd92023 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -586,8 +586,6 @@ void ppc_enable_pmcs(void) * SPRs which are not related to PMU. */ #ifdef CONFIG_PPC64 -SYSFS_SPRSETUP(purr, SPRN_PURR); -SYSFS_SPRSETUP(spurr, SPRN_SPURR); SYSFS_SPRSETUP(pir, SPRN_PIR); SYSFS_SPRSETUP(tscr, SPRN_TSCR); @@ -596,8 +594,6 @@ void ppc_enable_pmcs(void) enable write when needed with a separate function. Lets be conservative and default to pseries. */ -static DEVICE_ATTR(spurr, 0400, show_spurr, NULL); -static DEVICE_ATTR(purr, 0400, show_purr, store_purr); static DEVICE_ATTR(pir, 0400, show_pir, NULL); static DEVICE_ATTR(tscr, 0600, show_tscr, store_tscr); #endif /* CONFIG_PPC64 */ @@ -761,22 +757,110 @@ static void create_svm_file(void) } #endif /* CONFIG_PPC_SVM */ +#ifdef CONFIG_PPC64 +/* + * The duration (in ms) from the last IPI to the target CPU until + * which a cached value of purr, spurr, idle_purr, idle_spurr can be + * reported to the user on a corresponding sysfs file read. Beyond + * this duration, fresh values need to be obtained by sending IPIs to + * the target CPU when the sysfs files are read. + */ +static unsigned long util_stats_staleness_tolerance_ms = 10; +struct util_acct_stats { + u64 latest_purr; + u64 latest_spurr; +#ifdef CONFIG_PPC_PSERIES + u64 latest_idle_purr; + u64 latest_idle_spurr; +#endif + unsigned long last_update_jiffies; +}; + +DEFINE_PER_CPU(struct util_acct_stats, util_acct_stats); + +static void update_util_acct_stats(void *ptr) +{ + struct util_acct_stats *stats = ptr; + + stats->latest_purr = mfspr(SPRN_PURR); + stats->latest_spurr = mfspr(SPRN_SPURR); #ifdef CONFIG_PPC_PSERIES -static void read_idle_purr(void *val) + stats->latest_idle_purr = read_this_idle_purr(); + stats->latest_idle_spurr = read_this_idle_spurr(); +#endif + stats->last_update_jiffies = jiffies; +} + +struct util_acct_stats *get_util_stats_ptr(int cpu) +{ + struct util_acct_stats *stats = per_cpu_ptr(&util_acct_stats, cpu); + unsigned long delta_jiffies; + + delta_jiffies = jiffies - stats->last_update_jiffies; + + /* +* If we have a recent enough data, reuse that instead of +* sending an IPI. +*/ + if (jiffies_to_msecs(delta_jiffies) < util_stats_staleness_tolerance_ms) + return stats; + + smp_call_function_single(cpu, update_util_acct_stats, stats, 1); + return stats; +} + +static ssize_t show_purr(struct device *dev, +struct device_attribute *attr, char *buf) { - u64 *ret = val; + struct cpu *cpu = container_of(dev, struct cpu, dev); + struct util_acct_stats *stats; - *ret = read_this_idle_purr(); + stats = get_util_stats_ptr(cpu->dev.id); + return sprintf(buf, "%llx\n", stats->latest_purr); } +static void write_purr(void *val) +{ + mtspr(SPRN_PURR, *(unsigned long *)val); +} + +static ssize_t __used store_purr(struct device *dev, +struct device_attribute *attr, +const char *buf, size_t count) +{ + struct cpu *cpu = container_of(dev, struct cpu, dev); + unsigned long val; + int ret = kstrtoul(buf, 16, &val); + + if (ret != 0) + return -EINVAL; + + smp_call_function_single(cpu->dev.id, write_purr, &val, 1); + return count; +} +static DEVICE_ATTR(purr, 0400, show_purr, store_purr); + +static ssize_t show_spurr(struct device *dev, +
[PATCH v4 4/6] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU
From: "Gautham R. Shenoy" On Pseries LPARs, to calculate utilization, we need to know the [S]PURR ticks when the CPUs were busy or idle. The total PURR and SPURR ticks are already exposed via the per-cpu sysfs files "purr" and "spurr". This patch adds support for exposing the idle PURR and SPURR ticks via new per-cpu sysfs files named "idle_purr" and "idle_spurr". Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/sysfs.c | 82 +++-- 1 file changed, 79 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 479c706..571b325 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include "cacheinfo.h" @@ -760,6 +761,74 @@ static void create_svm_file(void) } #endif /* CONFIG_PPC_SVM */ +#ifdef CONFIG_PPC_PSERIES +static void read_idle_purr(void *val) +{ + u64 *ret = val; + + *ret = read_this_idle_purr(); +} + +static ssize_t idle_purr_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, dev); + u64 val; + + smp_call_function_single(cpu->dev.id, read_idle_purr, &val, 1); + return sprintf(buf, "%llx\n", val); +} +static DEVICE_ATTR(idle_purr, 0400, idle_purr_show, NULL); + +static void create_idle_purr_file(struct device *s) +{ + if (firmware_has_feature(FW_FEATURE_LPAR)) + device_create_file(s, &dev_attr_idle_purr); +} + +static void remove_idle_purr_file(struct device *s) +{ + if (firmware_has_feature(FW_FEATURE_LPAR)) + device_remove_file(s, &dev_attr_idle_purr); +} + +static void read_idle_spurr(void *val) +{ + u64 *ret = val; + + *ret = read_this_idle_spurr(); +} + +static ssize_t idle_spurr_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, dev); + u64 val; + + smp_call_function_single(cpu->dev.id, read_idle_spurr, &val, 1); + return sprintf(buf, "%llx\n", val); +} +static DEVICE_ATTR(idle_spurr, 0400, idle_spurr_show, NULL); + +static void create_idle_spurr_file(struct device *s) +{ + if (firmware_has_feature(FW_FEATURE_LPAR)) + device_create_file(s, &dev_attr_idle_spurr); +} + +static void remove_idle_spurr_file(struct device *s) +{ + if (firmware_has_feature(FW_FEATURE_LPAR)) + device_remove_file(s, &dev_attr_idle_spurr); +} + +#else /* CONFIG_PPC_PSERIES */ +#define create_idle_purr_file(s) +#define remove_idle_purr_file(s) +#define create_idle_spurr_file(s) +#define remove_idle_spurr_file(s) +#endif /* CONFIG_PPC_PSERIES */ + static int register_cpu_online(unsigned int cpu) { struct cpu *c = &per_cpu(cpu_devices, cpu); @@ -823,10 +892,13 @@ static int register_cpu_online(unsigned int cpu) if (!firmware_has_feature(FW_FEATURE_LPAR)) add_write_permission_dev_attr(&dev_attr_purr); device_create_file(s, &dev_attr_purr); + create_idle_purr_file(s); } - if (cpu_has_feature(CPU_FTR_SPURR)) + if (cpu_has_feature(CPU_FTR_SPURR)) { device_create_file(s, &dev_attr_spurr); + create_idle_spurr_file(s); + } if (cpu_has_feature(CPU_FTR_DSCR)) device_create_file(s, &dev_attr_dscr); @@ -910,11 +982,15 @@ static int unregister_cpu_online(unsigned int cpu) device_remove_file(s, &dev_attr_mmcra); #endif /* CONFIG_PMU_SYSFS */ - if (cpu_has_feature(CPU_FTR_PURR)) + if (cpu_has_feature(CPU_FTR_PURR)) { device_remove_file(s, &dev_attr_purr); + remove_idle_purr_file(s); + } - if (cpu_has_feature(CPU_FTR_SPURR)) + if (cpu_has_feature(CPU_FTR_SPURR)) { device_remove_file(s, &dev_attr_spurr); + remove_idle_spurr_file(s); + } if (cpu_has_feature(CPU_FTR_DSCR)) device_remove_file(s, &dev_attr_dscr); -- 1.9.4
[RFC/PATCH 1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.
From: "Gautham R. Shenoy" If a guest executes a stop instruction when the hypervisor has set the PSSCR[ESL|EC] bits, the processor will throw an Hypervisor Facility Unavailable exception. Currently when we receive this exception, we only check if the exeception is generated due to a doorbell instruction, in which case we emulate it. For all other cases, including the case when the guest executes a stop-instruction, the hypervisor sends a PROGILL to the guest program, which results in a guest crash. This patch adds code to handle the case when the hypervisor receives a H_FAC_UNAVAIL exception due to guest executing the stop instruction. The hypervisor increments the pc to the next instruction and resumes the guest as expected by the semantics of the PSSCR[ESL|EC] = 0 stop instruction. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/reg.h | 1 + arch/powerpc/kvm/book3s_hv.c | 6 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index da5cab0..2568c18 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -399,6 +399,7 @@ /* HFSCR and FSCR bit numbers are the same */ #define FSCR_SCV_LG12 /* Enable System Call Vectored */ #define FSCR_MSGP_LG 10 /* Enable MSGP */ +#define FSCR_STOP_LG9 /* Enable stop states */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ #define FSCR_TM_LG 5 /* Enable Transactional Memory */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 33be4d9..cdb7224 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1419,7 +1419,11 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu, if (((vcpu->arch.hfscr >> 56) == FSCR_MSGP_LG) && cpu_has_feature(CPU_FTR_ARCH_300)) r = kvmppc_emulate_doorbell_instr(vcpu); - if (r == EMULATE_FAIL) { + else if (((vcpu->arch.hfscr >> 56) == FSCR_STOP_LG) && + cpu_has_feature(CPU_FTR_ARCH_300)) { + kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4); + r = RESUME_GUEST; + } else if (r == EMULATE_FAIL) { kvmppc_core_queue_program(vcpu, SRR1_PROGILL); r = RESUME_GUEST; } -- 1.9.4
[RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
From: "Gautham R. Shenoy" ISA v3.0 allows the guest to execute a stop instruction. For this, the PSSCR[ESL|EC] bits need to be cleared by the hypervisor before scheduling in the guest vCPU. Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits set. This patch changes the behaviour to enter the guest with PSSCR[ESL|EC] bits cleared. This is a RFC patch where we unconditionally clear these bits. Ideally this should be done conditionally on platforms where the guest stop instruction has no Bugs (starting POWER9 DD2.3). Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kvm/book3s_hv.c| 2 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 + 2 files changed, 14 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index cdb7224..36d059a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit, mtspr(SPRN_IC, vcpu->arch.ic); mtspr(SPRN_PID, vcpu->arch.pid); - mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC | + mtspr(SPRN_PSSCR, (vcpu->arch.psscr & ~(PSSCR_EC | PSSCR_ESL)) | (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG)); mtspr(SPRN_HFSCR, vcpu->arch.hfscr); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index dbc2fec..c2daec3 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S) mtspr SPRN_PID, r7 mtspr SPRN_WORT, r8 BEGIN_FTR_SECTION + /* POWER9-only registers */ + ld r5, VCPU_TID(r4) + ld r6, VCPU_PSSCR(r4) + lbz r8, HSTATE_FAKE_SUSPEND(r13) + lis r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */ + andcr6, r6, r7 + rldimi r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG + ld r7, VCPU_HFSCR(r4) + mtspr SPRN_TIDR, r5 + mtspr SPRN_PSSCR, r6 + mtspr SPRN_HFSCR, r7 +FTR_SECTION_ELSE /* POWER8-only registers */ ld r5, VCPU_TCSCR(r4) ld r6, VCPU_ACOP(r4) @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION mtspr SPRN_CSIGR, r7 mtspr SPRN_TACR, r8 nop -FTR_SECTION_ELSE - /* POWER9-only registers */ - ld r5, VCPU_TID(r4) - ld r6, VCPU_PSSCR(r4) - lbz r8, HSTATE_FAKE_SUSPEND(r13) - orisr6, r6, PSSCR_EC@h /* This makes stop trap to HV */ - rldimi r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG - ld r7, VCPU_HFSCR(r4) - mtspr SPRN_TIDR, r5 - mtspr SPRN_PSSCR, r6 - mtspr SPRN_HFSCR, r7 -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300) +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300) 8: ld r5, VCPU_SPRG0(r4) -- 1.9.4
[RFC/PATCH 0/3] Add support for stop instruction inside KVM guest
From: "Gautham R. Shenoy" *** RFC Only. Not intended for inclusion Motivation ~~~ The POWER ISA v3.0 allows stop instruction to be executed from a Guest Kernel (HV=0,PR=0) context. If the hypervisor has cleared PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause the vCPU thread to "pause", thereby donating its cycles to the other threads in the core until the paused thread is woken up by an interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then execution of the "stop" instruction will raise a Hypervisor Facility Unavailable exception. The stop idle state in the guest (henceforth referred to as stop0lite) when enabled * has a very small wakeup latency (1-3us) comparable to that of snooze and considerably better compared the Shared CEDE state (25-30us). Results are provided below for wakeup latency measured by waking up an idle CPU in a given state using a timer as well as using an IPI. == Wakeup Latency measured using a timer (in ns) [Lower is better] == Idle state | Nr samples | Min| Max| Median | Avg | Stddev| == snooze | 60| 787| 1059 | 938 | 937.4 | 42.27 | == stop0lite | 60| 770| 1182 | 948 | 946.4 | 67.41 | == Shared CEDE| 60| 9550| 36694 | 29219 |28564.1|3545.9 | == == Wakeup Latency measured using a timer (in ns) [Lower is better] == Idle state | Nr samples | Min| Max| Median | Avg | Stddev| == snooze | 60| 787| 1059 | 938 | 937.4 | 42.27 | == stop0lite | 60| 770| 1182 | 948 | 946.4 | 67.41 | == Shared CEDE| 60| 9550| 36694 | 29219 |28564.1|3545.9 | == * provides an improved single threaded performance compared to snooze since the idle state completely relinquishes the core cycles. The single threaded performance is observed to be better even when compared to "Shared CEDE", since in the latter case something else can scheduled on the ceded CPU, while "stop0lite" doesn't give up the CPU. On a KVM guest with smp 8,sockets=1,cores=2,threads=4 with vCPUs of a vCore bound to a physical core, we run a single-threaded ebizzy pinned to one of the guest vCPUs while the sibling vCPUs in the core are idling. We enable only one guest idle state at a time to measure the single-threaded performance benefit that the idle state provides by giving up the core resources to the non-idle thread. we obtain ~13% improvement in the throughput compared to that with "snooze" and ~8% improvement in the throughput compared to "Shared CEDE". === | ebizzy records/s : [Higher the better] | === |Idle state | Nr| Min| Max| Median | Avg | Stddev | | |samples | ||| | | === |snooze | 10 | 1378988| 1379358| 1379032|1379067.3|113.47| === |stop0lite | 10 | 1561836| 1562058| 1561906|1561927.5| 81.87| === |Shared CEDE| 10 | 1446584| 1447383| 1447037|1447009.0|244.16| === Is stop0lite a replacement for snooze ? ~~~ Not yet. snooze is a polling state, and can respond much faster to a need_resched() compared to stop0lite which needs an IPI to wakeup from idle state. This can be seen in the results below: With the context_switch2 pipe test, we can see that with stop0lite, the number of context switches are 32.47% lesser than with snooze. This is due to the fact that snooze is a polling state which polls for TIF_NEED_RESCHED. Thus it does not require an interrupt t
[RFC/PATCH 3/3] cpuidle/pseries: Add stop0lite state
From: "Gautham R. Shenoy" The POWER ISA v3.0 allows stop instruction to be executed from a HV=0,PR=0 context. If the PSSCR[ESL|EC] bits are cleared, then the stop instruction thus executed will cause the thread to pause, thereby donating its cycles to the other threads in the core until the paused thread is woken up by an interrupt. In this patch we define a cpuidle state for pseries guests named stop0lite. This has a latency and residency intermediate to that of snooze and CEDE. While snooze has non-existent latency, it consumes the CPU cycles without contributing to anything useful. CEDE on the other hand requires a full VM exit, which can result in some other vCPU being scheduled on this physical CPU thereby delaying the scheduling of the CEDEd vCPU back. In such cases, when the expected idle duration is small (1-20us), the vCPU can go to this stop0lite state which provides a nice intermediate state between snooze and CEDE. Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 74c2479..9c8c18d 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -20,6 +20,7 @@ #include #include #include +#include struct cpuidle_driver pseries_idle_driver = { .name = "pseries_idle", @@ -170,6 +171,26 @@ static int shared_cede_loop(struct cpuidle_device *dev, .enter = &dedicated_cede_loop }, }; + + +static int stop_loop(struct cpuidle_device *dev, +struct cpuidle_driver *drv, +int index) +{ + unsigned long srr1 = 0; + + if (!prep_irq_for_idle_irqsoff()) + return index; + + __ppc64_runlatch_off(); + asm volatile("stop"); + __ppc64_runlatch_on(); + fini_irq_for_idle_irqsoff(); + irq_set_pending_from_srr1(srr1); + + return index; +} + /* * States for shared partition case. */ @@ -180,6 +201,12 @@ static int shared_cede_loop(struct cpuidle_device *dev, .exit_latency = 0, .target_residency = 0, .enter = &snooze_loop }, + { /* stop0_lite */ + .name = "stop0lite", + .desc = "Pauses the CPU", + .exit_latency = 2, + .target_residency=20, + .enter = &stop_loop }, { /* Shared Cede */ .name = "Shared Cede", .desc = "Shared Cede", -- 1.9.4
Re: [RFC/PATCH 0/3] Add support for stop instruction inside KVM guest
On Tue, Mar 31, 2020 at 05:40:55PM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" > > > *** RFC Only. Not intended for inclusion > > Motivation > ~~~ > > The POWER ISA v3.0 allows stop instruction to be executed from a Guest > Kernel (HV=0,PR=0) context. If the hypervisor has cleared > PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause > the vCPU thread to "pause", thereby donating its cycles to the other > threads in the core until the paused thread is woken up by an > interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then > execution of the "stop" instruction will raise a Hypervisor Facility > Unavailable exception. > > The stop idle state in the guest (henceforth referred to as stop0lite) > when enabled > > * has a very small wakeup latency (1-3us) comparable to that of > snooze and considerably better compared the Shared CEDE state > (25-30us). Results are provided below for wakeup latency measured > by waking up an idle CPU in a given state using a timer as well as > using an IPI. > > == > Wakeup Latency measured using a timer (in ns) [Lower is better] > == > Idle state | Nr samples | Min| Max| Median | Avg | Stddev| > == > snooze | 60| 787| 1059 | 938 | 937.4 | 42.27 | > == > stop0lite | 60| 770| 1182 | 948 | 946.4 | 67.41 | > == > Shared CEDE| 60| 9550| 36694 | 29219 |28564.1|3545.9 | > == > Posted two copies of Wakeup latency measured by timer. Here is the wakeup latency measured using an IPI. == Wakeup latency measured using an IPI (in ns) [Lower is better] == Idle state | Nr| Min| Max| Median | Avg | Stddev | |samples | ||| | | -- snooze | 60 | 2089|4228|2259| 2342.31|316.56| -- stop0lite | 60 | 1947|3674|2653| 2610.57|266.73| -- Shared CEDE| 60 |20147| 36305| 21827| 26762.65| 6875.01| == -- Thanks and Regards gautham.
Re: [PATCH v4 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr
Hello Naveen, On Wed, Apr 01, 2020 at 03:28:48PM +0530, Naveen N. Rao wrote: > Gautham R. Shenoy wrote: > >From: "Gautham R. Shenoy" > > [..snip..] > >-static DEVICE_ATTR(spurr, 0400, show_spurr, NULL); > >-static DEVICE_ATTR(purr, 0400, show_purr, store_purr); > > static DEVICE_ATTR(pir, 0400, show_pir, NULL); > > static DEVICE_ATTR(tscr, 0600, show_tscr, store_tscr); > > #endif /* CONFIG_PPC64 */ > >@@ -761,22 +757,110 @@ static void create_svm_file(void) > > } > > #endif /* CONFIG_PPC_SVM */ > > > >+#ifdef CONFIG_PPC64 > >+/* > >+ * The duration (in ms) from the last IPI to the target CPU until > >+ * which a cached value of purr, spurr, idle_purr, idle_spurr can be > >+ * reported to the user on a corresponding sysfs file read. Beyond > >+ * this duration, fresh values need to be obtained by sending IPIs to > >+ * the target CPU when the sysfs files are read. > >+ */ > >+static unsigned long util_stats_staleness_tolerance_ms = 10; > > This is a nice optimization for our use in lparstat, though I have a concern > below. > > >+struct util_acct_stats { > >+u64 latest_purr; > >+u64 latest_spurr; > >+#ifdef CONFIG_PPC_PSERIES > >+u64 latest_idle_purr; > >+u64 latest_idle_spurr; > >+#endif > > You can probably drop the 'latest_' prefix. Sure. > > >+unsigned long last_update_jiffies; > >+}; > >+ > >+DEFINE_PER_CPU(struct util_acct_stats, util_acct_stats); > > Per snowpatch, this should be static, and so should get_util_stats_ptr() > below: > https://openpower.xyz/job/snowpatch/job/snowpatch-linux-sparse/16601//artifact/linux/report.txt Ok, will fix this in v5. > > >+ > >+static void update_util_acct_stats(void *ptr) > >+{ > >+struct util_acct_stats *stats = ptr; > >+ > >+stats->latest_purr = mfspr(SPRN_PURR); > >+stats->latest_spurr = mfspr(SPRN_SPURR); > > #ifdef CONFIG_PPC_PSERIES > >-static void read_idle_purr(void *val) > >+stats->latest_idle_purr = read_this_idle_purr(); > >+stats->latest_idle_spurr = read_this_idle_spurr(); > >+#endif > >+stats->last_update_jiffies = jiffies; > >+} > >+ > >+struct util_acct_stats *get_util_stats_ptr(int cpu) > >+{ > >+struct util_acct_stats *stats = per_cpu_ptr(&util_acct_stats, cpu); > >+unsigned long delta_jiffies; > >+ > >+delta_jiffies = jiffies - stats->last_update_jiffies; > >+ > >+/* > >+ * If we have a recent enough data, reuse that instead of > >+ * sending an IPI. > >+ */ > >+if (jiffies_to_msecs(delta_jiffies) < util_stats_staleness_tolerance_ms) > >+return stats; > >+ > >+smp_call_function_single(cpu, update_util_acct_stats, stats, 1); > >+return stats; > >+} > >+ > >+static ssize_t show_purr(struct device *dev, > >+ struct device_attribute *attr, char *buf) > > { > >-u64 *ret = val; > >+struct cpu *cpu = container_of(dev, struct cpu, dev); > >+struct util_acct_stats *stats; > > > >-*ret = read_this_idle_purr(); > >+stats = get_util_stats_ptr(cpu->dev.id); > >+return sprintf(buf, "%llx\n", stats->latest_purr); > > This alters the behavior of the current sysfs purr file. I am not sure if it > is reasonable to return the same PURR value across a 10ms window. It does reduce it to 10ms window. I am not sure if anyone samples PURR etc faster than that rate. I measured how much time it takes to read the purr, spurr, idle_purr, idle_spurr files back-to-back. It takes not more than 150us. From lparstat will these values be read back-to-back ? If so, we can reduce the staleness_tolerance to something like 500us and still avoid extra IPIs. If not, what is the maximum delay between the first sysfs file read and the last sysfs file read ? > > I wonder if we should introduce a sysctl interface to control thresholding. > It can default to 0, which disables thresholding so that the existing > behavior continues. Applications (lparstat) can optionally set it to suit > their use. We would be introducing 3 new sysfs interfaces that way instead of two. /sys/devices/system/cpu/purr_spurr_staleness /sys/devices/system/cpu/cpuX/idle_purr /sys/devices/system/cpu/cpuX/idle_spurr I don't have a problem with this. Nathan, Michael, thoughts on this? The alternative is to have a procfs interface, something like /proc/powerpc/resource_util_stats which gives a listing similar to /proc/stat, i.e CPUX Even in this case, the values can be obtained in one-shot with a single IPI and be printed in the row corresponding to the CPU. > > - Naveen > -- Thanks and Regards gautham.
Re: [PATCH v4 2/6] powerpc/idle: Add accessor function to always read latest idle PURR
On Wed, Apr 01, 2020 at 03:12:53PM +0530, Naveen N. Rao wrote: > Hi Gautham, > > Gautham R. Shenoy wrote: > >From: "Gautham R. Shenoy" > > > >Currently when CPU goes idle, we take a snapshot of PURR via > >pseries_idle_prolog() which is used at the CPU idle exit to compute > >the idle PURR cycles via the function pseries_idle_epilog(). Thus, > >the value of idle PURR cycle thus read before pseries_idle_prolog() and > >after pseries_idle_epilog() is always correct. > > > >However, if we were to read the idle PURR cycles from an interrupt > >context between pseries_idle_prolog() and pseries_idle_epilog() (this will > >be done in a future patch), then, the value of the idle PURR thus read > >will not include the cycles spent in the most recent idle period. > > > >This patch addresses the issue by providing accessor function to read > >the idle PURR such such that it includes the cycles spent in the most > >recent idle period, if we read it between pseries_idle_prolog() and > >pseries_idle_epilog(). In order to achieve it, the patch saves the > >snapshot of PURR in pseries_idle_prolog() in a per-cpu variable, > >instead of on the stack, so that it can be accessed from an interrupt > >context. > > > >Signed-off-by: Gautham R. Shenoy > >--- > > arch/powerpc/include/asm/idle.h| 47 > > +++--- > > arch/powerpc/platforms/pseries/setup.c | 7 +++-- > > drivers/cpuidle/cpuidle-pseries.c | 15 +-- > > 3 files changed, 47 insertions(+), 22 deletions(-) > > > >diff --git a/arch/powerpc/include/asm/idle.h > >b/arch/powerpc/include/asm/idle.h > >index 32064a4c..d4bfb6a 100644 > >--- a/arch/powerpc/include/asm/idle.h > >+++ b/arch/powerpc/include/asm/idle.h > >@@ -5,10 +5,27 @@ > > #include > > > > #ifdef CONFIG_PPC_PSERIES > >-static inline void pseries_idle_prolog(unsigned long *in_purr) > >+DECLARE_PER_CPU(u64, idle_entry_purr_snap); > >+ > >+static inline void snapshot_purr_idle_entry(void) > >+{ > >+*this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR); > >+} > >+ > >+static inline void update_idle_purr_accounting(void) > >+{ > >+u64 wait_cycles; > >+u64 in_purr = *this_cpu_ptr(&idle_entry_purr_snap); > >+ > >+wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); > >+wait_cycles += mfspr(SPRN_PURR) - in_purr; > >+get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); > >+} > >+ > >+static inline void pseries_idle_prolog(void) > > { > > ppc64_runlatch_off(); > >-*in_purr = mfspr(SPRN_PURR); > >+snapshot_purr_idle_entry(); > > /* > > * Indicate to the HV that we are idle. Now would be > > * a good time to find other work to dispatch. > >@@ -16,16 +33,28 @@ static inline void pseries_idle_prolog(unsigned long > >*in_purr) > > get_lppaca()->idle = 1; > > } > > > >-static inline void pseries_idle_epilog(unsigned long in_purr) > >+static inline void pseries_idle_epilog(void) > > { > >-u64 wait_cycles; > >- > >-wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles); > >-wait_cycles += mfspr(SPRN_PURR) - in_purr; > >-get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); > >+update_idle_purr_accounting(); > > get_lppaca()->idle = 0; > >- > > ppc64_runlatch_on(); > > } > >+ > >+static inline u64 read_this_idle_purr(void) > >+{ > >+/* > >+ * If we are reading from an idle context, update the > >+ * idle-purr cycles corresponding to the last idle period. > >+ * Since the idle context is not yet over, take a fresh > >+ * snapshot of the idle-purr. > >+ */ > >+if (unlikely(get_lppaca()->idle == 1)) { > >+update_idle_purr_accounting(); > >+snapshot_purr_idle_entry(); > >+} > >+ > >+return be64_to_cpu(get_lppaca()->wait_state_cycles); > >+} > >+ > > I think this and read_this_idle_spurr() from the next patch should be moved > to Patch 4/6, where they are actually used. The reason I included this function in this patch was to justify why we were introducing snapshotting the purr values in a global per-cpu variable instead of on a stack variable. The reason being that someone might want to read the PURR value from an interrupt context which had woken up the CPU from idle. At this point, since epilog() function wasn't called, the idle PURR count corresponding to
Re: [PATCH v4 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr
Hi Naveen, On Thu, Apr 02, 2020 at 01:04:34PM +0530, Naveen N. Rao wrote: [..snip..] > > > >It does reduce it to 10ms window. I am not sure if anyone samples PURR > >etc faster than that rate. > > > >I measured how much time it takes to read the purr, spurr, idle_purr, > >idle_spurr files back-to-back. It takes not more than 150us. From > >lparstat will these values be read back-to-back ? If so, we can reduce > >the staleness_tolerance to something like 500us and still avoid extra > >IPIs. If not, what is the maximum delay between the first sysfs file > >read and the last sysfs file read ? > > Oh, for lparstat usage, this is perfectly fine. > > I meant that there could be other users of [s]purr who might care. I don't > know of one, but since this is an existing sysfs interface, I wanted to > point out that the behavior might change. Fair point. Perhaps this should be documented in the Documentation, if we are going to continue with this patch. > > > > >> > >>I wonder if we should introduce a sysctl interface to control thresholding. > >>It can default to 0, which disables thresholding so that the existing > >>behavior continues. Applications (lparstat) can optionally set it to suit > >>their use. > > > >We would be introducing 3 new sysfs interfaces that way instead of > >two. > > > >/sys/devices/system/cpu/purr_spurr_staleness > >/sys/devices/system/cpu/cpuX/idle_purr > >/sys/devices/system/cpu/cpuX/idle_spurr > > > >I don't have a problem with this. Nathan, Michael, thoughts on this? > > > > > >The alternative is to have a procfs interface, something like > >/proc/powerpc/resource_util_stats > > > >which gives a listing similar to /proc/stat, i.e > > > > CPUX > > > >Even in this case, the values can be obtained in one-shot with a > >single IPI and be printed in the row corresponding to the CPU. > > Right -- and that would be optimal requiring a single system call, at the > cost of using a legacy interface. > > The other option would be to drop this patch and to just go with patches 1-5 > introducing the new sysfs interfaces for idle_[s]purr. It isn't entirely > clear how often this would be used, or its actual impact. We can perhaps > consider this optimization if and when this causes problems... I am ok with that. We can revisit the problem if IPI noise becomes noticable. However, if Nathan or Michael feel that this problem is better solved now, than leaving it for the future, we will have to take a call on what the interface is going to be. > > > Thanks, > Naveen > -- Thanks and Regards gautham.
Re: [PATCH v4 2/6] powerpc/idle: Add accessor function to always read latest idle PURR
On Fri, Apr 03, 2020 at 04:04:56PM +0530, Naveen N. Rao wrote: > Gautham R Shenoy wrote: > >On Wed, Apr 01, 2020 at 03:12:53PM +0530, Naveen N. Rao wrote: > >>Hi Gautham, > >> > >>Gautham R. Shenoy wrote: > >>>From: "Gautham R. Shenoy" > >>> > >>>+ > >>>+static inline u64 read_this_idle_purr(void) > >>>+{ > >>>+ /* > >>>+ * If we are reading from an idle context, update the > >>>+ * idle-purr cycles corresponding to the last idle period. > >>>+ * Since the idle context is not yet over, take a fresh > >>>+ * snapshot of the idle-purr. > >>>+ */ > >>>+ if (unlikely(get_lppaca()->idle == 1)) { > >>>+ update_idle_purr_accounting(); > >>>+ snapshot_purr_idle_entry(); > >>>+ } > >>>+ > >>>+ return be64_to_cpu(get_lppaca()->wait_state_cycles); > >>>+} > >>>+ > >> > >>I think this and read_this_idle_spurr() from the next patch should be moved > >>to Patch 4/6, where they are actually used. > > > >The reason I included this function in this patch was to justify why > >we were introducing snapshotting the purr values in a global per-cpu > >variable instead of on a stack variable. The reason being that someone > >might want to read the PURR value from an interrupt context which had > >woken up the CPU from idle. At this point, since epilog() function > >wasn't called, the idle PURR count corresponding to this latest idle > >period would have been accumulated in lppaca->wait_cycles. Thus, this > >helper function safely reads the value by > > 1) First updating the lppaca->wait_cycles with the latest idle_purr > > count. > > 2) Take a fresh snapshot, since the time from now to the epilog() > > call is also counted under idle CPU. So the PURR cycle increment > > during this short period should also be accumulated in > > lppaca->wait_cycles. > > > > > >prolog() > >|snapshot PURR > >| > >| > >| > >Idle > >| > >| <- Interrupt . Read idle PURR update idle PURR; > >| snapshot PURR; > >| Read idle PURR. | > >epilog() > > update idle PURR > > > > Yes, I understand. It makes sense. > > > > >However, if you feel that moving this function to Patch 4 where it is > >actually used makes it more readable, I can do that. > > My suggestion was from a bisectability standpoint though. This is a fairly > simple function, but it is generally recommended to ensure that newly added > code gets exercized in the patch that it is introduced in: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/5.Posting.rst#n119 > Fair point. Will move those functions to Patch 4. > > Regards, > Naveen >
Re: [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote: > Gautham R. Shenoy's on March 31, 2020 10:10 pm: > > From: "Gautham R. Shenoy" > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before > > scheduling in the guest vCPU. > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits > > set. This patch changes the behaviour to enter the guest with > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we > > unconditionally clear these bits. Ideally this should be done > > conditionally on platforms where the guest stop instruction has no > > Bugs (starting POWER9 DD2.3). > > How will guests know that they can use this facility safely after your > series? You need both DD2.3 and a patched KVM. Yes, this is something that isn't addressed in this series (mentioned in the cover letter), which is a POC demonstrating that the stop0lite state in guest works. However, to answer your question, this is the scheme that I had in mind : OPAL: On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest" Hypervisor Kernel: 1. If "idle-stop-guest" dt-cpu-feature is discovered, then we set bool enable_guest_stop = true; 2. During KVM guest entry, clear PSSCR[ESL|EC] iff enable_guest_stop == true. 3. In kvm_vm_ioctl_check_extension(), for a new capability KVM_CAP_STOP, return true iff enable_guest_top == true. QEMU: Check with the hypervisor if KVM_CAP_STOP is present. If so, indicate the presence to the guest via device tree. Guest Kernel: Check for the presence of guest stop state support in device-tree. If available, enable the stop0lite in the cpuidle driver. We still have a challenge of migrating a guest which started on a hypervisor supporting guest stop state to a hypervisor without it. The target hypervisor should atleast have Patch 1 of this series, so that we don't crash the guest. > > > > > Signed-off-by: Gautham R. Shenoy > > --- > > arch/powerpc/kvm/book3s_hv.c| 2 +- > > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 + > > 2 files changed, 14 insertions(+), 13 deletions(-) > > > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > > index cdb7224..36d059a 100644 > > --- a/arch/powerpc/kvm/book3s_hv.c > > +++ b/arch/powerpc/kvm/book3s_hv.c > > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu > > *vcpu, u64 time_limit, > > mtspr(SPRN_IC, vcpu->arch.ic); > > mtspr(SPRN_PID, vcpu->arch.pid); > > > > - mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC | > > + mtspr(SPRN_PSSCR, (vcpu->arch.psscr & ~(PSSCR_EC | PSSCR_ESL)) | > > (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG)); > > > > mtspr(SPRN_HFSCR, vcpu->arch.hfscr); > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > > index dbc2fec..c2daec3 100644 > > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S) > > mtspr SPRN_PID, r7 > > mtspr SPRN_WORT, r8 > > BEGIN_FTR_SECTION > > + /* POWER9-only registers */ > > + ld r5, VCPU_TID(r4) > > + ld r6, VCPU_PSSCR(r4) > > + lbz r8, HSTATE_FAKE_SUSPEND(r13) > > + lis r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */ > > + andcr6, r6, r7 > > + rldimi r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG > > + ld r7, VCPU_HFSCR(r4) > > + mtspr SPRN_TIDR, r5 > > + mtspr SPRN_PSSCR, r6 > > + mtspr SPRN_HFSCR, r7 > > +FTR_SECTION_ELSE > > Why did you move these around? Just because the POWER9 section became > larger than the other? Yes. > > That's a real wart in the instruction patching implementation, I think > we can fix it by padding with nops in the macros. > > Can you just add the additional required nops to the top branch without > changing them around for this patch, so it's easier to see what's going > on? The end result will be the same after patching. Actually changing > these around can have a slight unintended consequence in that code that > runs before features were patched will execute the IF code. Not a > problem here, but another reason why the instruction patching > restriction is annoying. Sure, I will repost this patch with additional nops instead of moving them around. &
[PATCH v2 0/1] pseries/hotplug: Change the default behaviour of cede_offline
From: "Gautham R. Shenoy" This is the v2 of the fix to change the default behaviour of cede_offline. The previous version can be found here: https://lkml.org/lkml/2019/9/12/222 The main change from v1 is that the patch2 to create a sysfs file to report and control the value of cede_offline_enabled has been dropped. Problem Description: Currently on Pseries Linux Guests, the offlined CPU can be put to one of the following two states: - Long term processor cede (also called extended cede) - Returned to the Hypervisor via RTAS "stop-self" call. This is controlled by the kernel boot parameter "cede_offline=on/off". By default the offlined CPUs enter extended cede. The PHYP hypervisor considers CPUs in extended cede to be "active" since the CPUs are still under the control fo the Linux Guests. Hence, when we change the SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will continue to count the values for offlined CPUs in extended cede as if they are online. One of the expectations with PURR is that the for an interval of time, the sum of the PURR increments across the online CPUs of a core should equal the number of timebase ticks for that interval. This is currently not the case. In the following data (Generated using https://github.com/gautshen/misc/blob/master/purr_tb.py): SD-PURR = Sum of PURR increments on online CPUs of that core in 1 second SMT=off === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0]51200 69883784 core01 [ 8]51200 88782536 core02 [ 16]51200 94296824 core03 [ 24]51200 80951968 SMT=2 === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1] 51200 136147792 core01 [ 8,9] 51200 128636784 core02 [ 16,17] 51200 135426488 core03 [ 24,25] 51200 153027520 SMT=4 === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1,2,3] 51200 258331616 core01 [ 8,9,10,11]51200 274220072 core02 [ 16,17,18,19] 51200 260013736 core03 [ 24,25,26,27] 51200 260079672 SMT=on === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1,2,3,4,5,6,7] 51200 512941248 core01 [ 8,9,10,11,12,13,14,15]51200 512936544 core02 [ 16,17,18,19,20,21,22,23] 51200 512931544 core03 [ 24,25,26,27,28,29,30,31] 51200 512923800 This patchset addresses this issue by ensuring that by default, the offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call by changing the default value of "cede_offline_enabled" to false. With the patches, we see that the observed value of the sum of the PURR increments across the the online threads of a core in 1-second matches the number of tb-ticks in 1-second. SMT=off === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0]51200512527568 core01 [ 8]51200512556128 core02 [ 16]51200512590016 core03 [ 24]51200512589440 SMT=2 === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1] 51200 512635328 core01 [ 8,9] 51200 512610416 core02 [ 16,17] 51200 512639360 core03 [ 24,25] 51200 512638720 SMT=4 === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1,2,3] 51200 512757328 core01 [ 8,9,10,11]51200 512727920 core02 [ 16,17,18,19] 51200 512754712 core03 [ 24,25,26,27] 51200 512739040 SMT=on == Core SD-PURR SD-PURR
[PATCH v2 1/1] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"
From: "Gautham R. Shenoy" Currently on PSeries Linux guests, the offlined CPU can be put to one of the following two states: - Long term processor cede (also called extended cede) - Returned to the hypervisor via RTAS "stop-self" call. This is controlled by the kernel boot parameter "cede_offline=on/off". By default the offlined CPUs enter extended cede. The PHYP hypervisor considers CPUs in extended cede to be "active" since they are still under the control fo the Linux guests. Hence, when we change the SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will continue to count the values for offlined CPUs in extended cede as if they are online. This breaks the accounting in tools such as lparstat. To fix this, ensure that by default the offlined CPUs are returned to the hypervisor via RTAS "stop-self" call by changing the default value of "cede_offline_enabled" to false. Fixes: commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") Signed-off-by: Gautham R. Shenoy --- Documentation/core-api/cpu_hotplug.rst | 2 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index 4a50ab7..5319593 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -53,7 +53,7 @@ Command Line Switches ``cede_offline={"off","on"}`` Use this option to disable/enable putting offlined processors to an extended ``H_CEDE`` state on supported pseries platforms. If nothing is specified, - ``cede_offline`` is set to "on". + ``cede_offline`` is set to "off". This option is limited to the PowerPC architecture. diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index bbda646..f9d0366 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -46,7 +46,17 @@ static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE; -static bool cede_offline_enabled __read_mostly = true; +/* + * Determines whether the offlined CPUs should be put to a long term + * processor cede (called extended cede) for power-saving + * purposes. The CPUs in extended cede are still with the Linux Guest + * and are not returned to the Hypervisor. + * + * By default, the offlined CPUs are returned to the hypervisor via + * RTAS "stop-self". This behaviour can be changed by passing the + * kernel commandline parameter "cede_offline=on". + */ +static bool cede_offline_enabled __read_mostly; /* * Enable/disable cede_offline when available. -- 1.9.4