from:"Gautham R Shenoy"

[PATCH] book3s_hv_rmhandlers:Pass the correct trap argument to kvmhv_commence_exit

2015-05-21 Thread Gautham R. Shenoy

In guest_exit_cont we call kvmhv_commence_exit which expects the trap
number as the argument. However r3 doesn't contain the trap number at
this point and as a result we would be calling the function with a
spurious trap number.

Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
r12 contains the trap number

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 4d70df2..f0d7c54 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1170,6 +1170,7 @@ mc_cont:
bl  kvmhv_accumulate_time
 #endif
 
+   mr  r3, r12
/* Increment exit count, poke other threads to exit */
bl  kvmhv_commence_exit
nop
-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: Add an inline function to update HID0

2015-08-04 Thread Gautham R. Shenoy

Section 3.7 of Version 1.2 of the Power8 Processor User's Manual
prescribes that updates to HID0 be preceded by a SYNC instruction and
followed by an ISYNC instruction (Page 91).

Create a function name update_hid0() which follows this recipe and
invoke it from the static split core path.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/kvm_ppc.h   | 11 +++
 arch/powerpc/platforms/powernv/subcore.c |  4 ++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index c6ef05b..325f1d6 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -685,4 +685,15 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu 
*vcpu, int ra, int rb)
 
 extern void xics_wake_cpu(int cpu);
 
+static inline void update_hid0(unsigned long hid0)
+{
+   /*
+*  The HID0 update should at the very least be preceded by a
+*  a SYNC instruction followed by an ISYNC instruction
+*/
+   mb();
+   mtspr(SPRN_HID0, hid0);
+   isync();
+}
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/platforms/powernv/subcore.c 
b/arch/powerpc/platforms/powernv/subcore.c
index f60f80a..37e77bf 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -190,7 +190,7 @@ static void unsplit_core(void)
 
hid0 = mfspr(SPRN_HID0);
hid0 &= ~HID0_POWER8_DYNLPARDIS;
-   mtspr(SPRN_HID0, hid0);
+   update_hid0(hid0);
update_hid_in_slw(hid0);
 
while (mfspr(SPRN_HID0) & mask)
@@ -227,7 +227,7 @@ static void split_core(int new_mode)
/* Write new mode */
hid0  = mfspr(SPRN_HID0);
hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value;
-   mtspr(SPRN_HID0, hid0);
+   update_hid0(hid0);
update_hid_in_slw(hid0);
 
/* Wait for it to happen */
-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: Add an inline function to update HID0

2015-08-04 Thread Gautham R Shenoy

Hi Michael,

On Tue, Aug 04, 2015 at 08:08:58PM +1000, Michael Ellerman wrote:
> On Tue, 2015-04-08 at 08:30:58 UTC, "Gautham R. Shenoy" wrote:
> > Section 3.7 of Version 1.2 of the Power8 Processor User's Manual
> > prescribes that updates to HID0 be preceded by a SYNC instruction and
> > followed by an ISYNC instruction (Page 91).
> > 
> > Create a function name update_hid0() which follows this recipe and
> > invoke it from the static split core path.
> > 
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> >  arch/powerpc/include/asm/kvm_ppc.h   | 11 +++
> 
> Why is it in there? It's not KVM related per se.

Ok. Will fix this.
> 
> Where should it go? I think reg.h would be best, ideally near the definition
> for HID0, though that's probably not possible because of ASSEMBLY 
> requirements.
> So at the bottom of reg.h ?
> 
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> > b/arch/powerpc/include/asm/kvm_ppc.h
> > index c6ef05b..325f1d6 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -685,4 +685,15 @@ static inline ulong kvmppc_get_ea_indexed(struct 
> > kvm_vcpu *vcpu, int ra, int rb)
> >  
> >  extern void xics_wake_cpu(int cpu);
> >  
> > +static inline void update_hid0(unsigned long hid0)
> > +{
> > +   /*
> > +*  The HID0 update should at the very least be preceded by a
> > +*  a SYNC instruction followed by an ISYNC instruction
> > +*/
> > +   mb();
> > +   mtspr(SPRN_HID0, hid0);
> > +   isync();
> 
> That's going to turn into three separate inline asm blocks, which is maybe a
> bit unfortunate. Have you checked the generated code is what we want, ie. just
> sync, mtspr, isync ?
> 

Yes, the objdump of subcore.o shows exactly these three instructions:
 7c 00 04 ac sync
 7c 70 fb a6 mtspr   1008,r3
 4c 00 01 2c isync

> cheers
>
--
Thanks and Regards
gautham. 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc: Add an inline function to update HID0

2015-08-04 Thread Gautham R. Shenoy

Section 3.7 of Version 1.2 of the Power8 Processor User's Manual
prescribes that updates to HID0 be preceded by a SYNC instruction and
followed by an ISYNC instruction (Page 91).

Create an inline function name update_hid0() which follows this recipe
and invoke it from the static split core path.

Signed-off-by: Gautham R. Shenoy 
---
[v1--> v2: Moved defn of update_hid0 to reg.h from kvm_ppc.h]

 arch/powerpc/include/asm/reg.h   | 13 +
 arch/powerpc/platforms/powernv/subcore.c |  4 ++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index af56b5c..beb98ac 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -12,6 +12,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 /* Pickup Book E specific registers. */
 #if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
@@ -1281,6 +1283,17 @@ struct pt_regs;
 
 extern void ppc_save_regs(struct pt_regs *regs);
 
+static inline void update_hid0(unsigned long hid0)
+{
+   /*
+*  The HID0 update should at the very least be preceded by a
+*  a SYNC instruction followed by an ISYNC instruction
+*/
+   mb();
+   mtspr(SPRN_HID0, hid0);
+   isync();
+}
+
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_REG_H */
diff --git a/arch/powerpc/platforms/powernv/subcore.c 
b/arch/powerpc/platforms/powernv/subcore.c
index f60f80a..37e77bf 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -190,7 +190,7 @@ static void unsplit_core(void)
 
hid0 = mfspr(SPRN_HID0);
hid0 &= ~HID0_POWER8_DYNLPARDIS;
-   mtspr(SPRN_HID0, hid0);
+   update_hid0(hid0);
update_hid_in_slw(hid0);
 
while (mfspr(SPRN_HID0) & mask)
@@ -227,7 +227,7 @@ static void split_core(int new_mode)
/* Write new mode */
hid0  = mfspr(SPRN_HID0);
hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value;
-   mtspr(SPRN_HID0, hid0);
+   update_hid0(hid0);
update_hid_in_slw(hid0);
 
/* Wait for it to happen */
-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: Add an inline function to update HID0

2015-08-04 Thread Gautham R Shenoy

Hi Segher,

Thanks for the suggestions. I will rename the function to
update_power8_hid0() and use asm volatile.


On Tue, Aug 04, 2015 at 09:30:57PM -0500, Segher Boessenkool wrote:
> On Tue, Aug 04, 2015 at 08:08:58PM +1000, Michael Ellerman wrote:
> > > +static inline void update_hid0(unsigned long hid0)
> > > +{
> > > + /*
> > > +  *  The HID0 update should at the very least be preceded by a
> > > +  *  a SYNC instruction followed by an ISYNC instruction
> > > +  */
> > > + mb();
> > > + mtspr(SPRN_HID0, hid0);
> > > + isync();
> > 
> > That's going to turn into three separate inline asm blocks, which is maybe a
> > bit unfortunate. Have you checked the generated code is what we want, ie. 
> > just
> > sync, mtspr, isync ?
> 
> The "mb()" is not such a great name anyway: you don't want a memory
> barrier, you want an actual sync instruction ("sync 0", "hwsync",
> whatever the currently preferred spelling is).
> 
> The function name should also say this is for POWER8 (the required
> sequences are different for some other processors; and some others
> might not even _have_ a HID0, or not at 1008).  power8_write_hid0
> or such?
> 
> For writing it as one asm, why not just
> 
>   asm volatile("sync ; mtspr %0,%1 ; isync" : : "i"(SPRN_HID0), "r"(hid0));
> 
> instead of the stringify stuff?
> 
> 
> Segher
> 

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3] powerpc: Add an inline function to update POWER8 HID0

2015-08-05 Thread Gautham R. Shenoy

Section 3.7 of Version 1.2 of the Power8 Processor User's Manual
prescribes that updates to HID0 be preceded by a SYNC instruction and
followed by an ISYNC instruction (Page 91).

Create an inline function name update_power8_hid0() which follows this
recipe and invoke it from the static split core path.

Signed-off-by: Gautham R. Shenoy 
---
[v1 --> v2: Moved defn of update_hid0 to reg.h from kvm_ppc.h]
[v2 --> v3: Renamed to update_power8_hid0 and used asm volatile]

 arch/powerpc/include/asm/reg.h   | 9 +
 arch/powerpc/platforms/powernv/subcore.c | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index af56b5c..1245d99 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1281,6 +1281,15 @@ struct pt_regs;
 
 extern void ppc_save_regs(struct pt_regs *regs);
 
+static inline void update_power8_hid0(unsigned long hid0)
+{
+   /*
+*  The HID0 update on Power8 should at the very least be
+*  preceded by a a SYNC instruction followed by an ISYNC
+*  instruction
+*/
+   asm volatile("sync; mtspr %0,%1; isync":: "i"(SPRN_HID0), "r"(hid0));
+}
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_REG_H */
diff --git a/arch/powerpc/platforms/powernv/subcore.c 
b/arch/powerpc/platforms/powernv/subcore.c
index f60f80a..503a73f 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -190,7 +190,7 @@ static void unsplit_core(void)
 
hid0 = mfspr(SPRN_HID0);
hid0 &= ~HID0_POWER8_DYNLPARDIS;
-   mtspr(SPRN_HID0, hid0);
+   update_power8_hid0(hid0);
update_hid_in_slw(hid0);
 
while (mfspr(SPRN_HID0) & mask)
@@ -227,7 +227,7 @@ static void split_core(int new_mode)
/* Write new mode */
hid0  = mfspr(SPRN_HID0);
hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value;
-   mtspr(SPRN_HID0, hid0);
+   update_power8_hid0(hid0);
update_hid_in_slw(hid0);
 
/* Wait for it to happen */
-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] book3s_hv_rmhandlers:Pass the correct trap argument to kvmhv_commence_exit

2015-08-16 Thread Gautham R Shenoy

Hi Sam,

On Fri, Aug 14, 2015 at 03:07:28PM +1000, Sam Bobroff wrote:
> On Thu, May 21, 2015 at 01:57:04PM +0530, Gautham R. Shenoy wrote:
> > In guest_exit_cont we call kvmhv_commence_exit which expects the trap
> > number as the argument. However r3 doesn't contain the trap number at
> > this point and as a result we would be calling the function with a
> > spurious trap number.
> > 
> > Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
> > r12 contains the trap number
> > 
> > Signed-off-by: Gautham R. Shenoy 
> 
> Hi Gautham,
> 
> I agree with your logic: r3 is quite clearly corrupted in that path. So:
> 
> Reviewed-by: Sam Bobroff 
> 
> Just one comment: Do you have a case of this causing some visible problem due
> to the corrupted trap number? (I'll test the patch if you do.)
> 

Actually no! I found this bug while reviewing the code for some other issue.

> Cheers,
> Sam.

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/cpufreq: Add pr_warn() on OPAL firmware failures

2014-08-04 Thread Gautham R Shenoy

On Sun, Aug 03, 2014 at 02:54:05PM +0530, Vaidyanathan Srinivasan wrote:
> @@ -131,7 +136,12 @@ static unsigned int pstate_id_to_freq(int pstate_id)
>   int i;
> 
>   i = powernv_pstate_info.max - pstate_id;
> - BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
> + if (i >= powernv_pstate_info.nr_pstates || i < 0) {
> + pr_warn("PState id %d outside of PState table, "
> + "reporting nominal id %d instead\n",
> + pstate_id, powernv_pstate_info.nominal);
> + i = powernv_pstate_info.max - powernv_pstate_info.nominal;

As of now the default loglevel corresponds to KERN_WARNING so this
warning should get printed anyway. However, don't you think it would
be better if we make it a pr_err( ) since it's a platform error that's
causing the pstate_id to go out of bounds ?

Otherwise it looks ok.

Acked-by: Gautham R. Shenoy 

> + }
> 
>   return powernv_freqs[i].frequency;
>  }


--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/4] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.

2014-03-03 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Move the code that computes the cpumask corresponding to the
thread-siblings of a cpu to a new method named
powernv_cpu_to_core_mask() so that it can be used by other places in
the code.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index c59eb26..f0dae6f 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -166,6 +166,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = {
 
 /* Helper routines */
 
+/**
+ * Sets the bits corresponding to the thread-siblings of cpu in its core
+ * in 'cpus'.
+ */
+static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus)
+{
+   int base, i;
+
+   base = cpu_first_thread_sibling(cpu);
+
+   for (i = 0; i < threads_per_core; i++) {
+   cpumask_set_cpu(base + i, cpus);
+   }
+
+   return;
+}
+
 /* Access helpers to power mgt SPR */
 
 static inline unsigned long get_pmspr(unsigned long sprn)
@@ -231,13 +248,10 @@ static int powernv_set_freq(cpumask_var_t cpus, unsigned 
int new_index)
 
 static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 {
-   int base, i;
+   int i;
 
 #ifdef CONFIG_SMP
-   base = cpu_first_thread_sibling(policy->cpu);
-
-   for (i = 0; i < threads_per_core; i++)
-   cpumask_set_cpu(base + i, policy->cpus);
+   powernv_cpu_to_core_mask(policy->cpu, policy->cpus);
 #endif
policy->cpuinfo.transition_latency = 25000;
 
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/4] powernv:cpufreq: Create pstate_id_to_freq() helper

2014-03-03 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Create a helper routine that can return the cpu-frequency for the
corresponding pstate_id.

Also, cache the values of the pstate_max, pstate_min and
pstate_nominal and nr_pstates in a static structure so that they can
be reused in the future to perform any validations.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 345501e..d0a8dee 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -37,6 +37,15 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
 #define POWERNV_MAX_PSTATES256
 
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
+struct powernv_pstate_info {
+   int pstate_min_id;
+   int pstate_max_id;
+   int pstate_nominal_id;
+   int nr_pstates;
+};
+
+static struct powernv_pstate_info powernv_pstate_info;
+
 
 /*
  * Initialize the freq table based on data obtained
@@ -106,9 +115,28 @@ static int init_powernv_pstates(void)
powernv_freqs[i].driver_data = 0;
powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
 
+   powernv_pstate_info.pstate_min_id = pstate_min;
+   powernv_pstate_info.pstate_max_id = pstate_max;
+   powernv_pstate_info.pstate_nominal_id = pstate_nominal;
+   powernv_pstate_info.nr_pstates = nr_pstates;
+
return 0;
 }
 
+/**
+ * Returns the cpu frequency corresponding to the pstate_id.
+ */
+static unsigned int pstate_id_to_freq(int pstate_id)
+{
+   int i;
+
+   i = powernv_pstate_info.pstate_max_id - pstate_id;
+
+   BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
+   WARN_ON(powernv_freqs[i].driver_data != pstate_id);
+   return powernv_freqs[i].frequency;
+}
+
 static struct freq_attr *powernv_cpu_freq_attr[] = {
&cpufreq_freq_attr_scaling_available_freqs,
NULL,
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/4] powernv:cpufreq: Export nominal frequency via sysfs.

2014-03-03 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Create a driver attribute named cpuinfo_nominal_frequency which
creates a sysfs read-only file named cpuinfo_nominal_frequency. Export
the frequency corresponding to the nominal_pstate through this
interface.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index d0a8dee..c59eb26 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -137,8 +137,30 @@ static unsigned int pstate_id_to_freq(int pstate_id)
return powernv_freqs[i].frequency;
 }
 
+/**
+ * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by
+ * the firmware
+ */
+static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy,
+   char *buf)
+{
+   int nominal_freq;
+   nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id);
+   return sprintf(buf, "%u\n", nominal_freq);
+}
+
+
+struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = {
+   .attr = { .name = "cpuinfo_nominal_freq",
+ .mode = 0444,
+   },
+   .show = show_cpuinfo_nominal_freq,
+};
+
+
 static struct freq_attr *powernv_cpu_freq_attr[] = {
&cpufreq_freq_attr_scaling_available_freqs,
+   &cpufreq_freq_attr_cpuinfo_nominal_freq,
NULL,
 };
 
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/4] powernv:cpufreq: Export nominal and current frequency via sysfs

2014-03-03 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

On IBM POWERNV platforms[1], presently we do not report the current
operating frequency of the processor through the sysfs interface
"cpuinfo_cur_freq" since the cpu frequency driver[1] has not
implemented the ->get(unsigned int cpu) method. Fix this by
implementing the ->get(unsigned int cpu) method which will report the
frequency corresponding to the pstate_id on PMSR on "cpu".

Also, export the nominal frequency of the processor through the sysfs
interface "cpuinfo_nominal_freq" by defining it as a driver attribute.

The patch series depends on the cpu-frequency driver patch-series[1].

[1] http://comments.gmane.org/gmane.linux.ports.ppc.embedded/67905

Thanks and Regards
gautham.

Summary:
==
Gautham R. Shenoy (4):
  powernv:cpufreq: Create pstate_id_to_freq() helper
  powernv:cpufreq: Export nominal frequency via sysfs.
  powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.
  powernv:cpufreq: Implement the driver->get() method

 drivers/cpufreq/powernv-cpufreq.c | 115 --
 1 file changed, 110 insertions(+), 5 deletions(-)

-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 4/4] powernv:cpufreq: Implement the driver->get() method

2014-03-03 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

The current frequency of a cpu is reported through the sysfs file
cpuinfo_cur_freq. This requires the driver to implement a
"->get(unsigned int cpu)" method which will return the current
operating frequency.

Implement a function named powernv_cpufreq_get() which corresponds to
the required ->get() method.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 41 +++
 1 file changed, 41 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index f0dae6f..5f43e4f 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -218,6 +218,46 @@ static inline void set_pmspr(unsigned long sprn, unsigned 
long val)
BUG();
 }
 
+/**
+ * Computes the current frequency on this cpu
+ * and stores the result in *ret_freq.
+ */
+static void powernv_get_cpu_freq(void *ret_freq)
+{
+   unsigned long pmspr_val;
+   long pstate_id;
+   int *cur_freq, freq;
+
+   cur_freq = (int *)ret_freq;
+   pmspr_val = get_pmspr(SPRN_PMSR);
+   pstate_id = pmspr_val;
+   pstate_id = pstate_id >> 56;
+   WARN_ON(pstate_id > 0);
+   freq = pstate_id_to_freq(pstate_id);
+   pr_debug("cpu %d pmsr %lx pstate_id %ld frequency %d \n",
+   smp_processor_id(), pmspr_val, pstate_id, freq);
+   *cur_freq = freq;
+}
+
+/**
+ * Returns the cpu frequency as reported by the firmware for 'cpu'.
+ * This value is reported through the sysfs file cpuinfo_cur_freq.
+ */
+unsigned int powernv_cpufreq_get(unsigned int cpu)
+{
+   int ret_freq;
+   cpumask_var_t sibling_mask;
+
+   if (!zalloc_cpumask_var(&sibling_mask, GFP_KERNEL))
+   return -ENOMEM;
+
+   powernv_cpu_to_core_mask(cpu, sibling_mask);
+   smp_call_function_any(sibling_mask, powernv_get_cpu_freq, &ret_freq, 1);
+
+   free_cpumask_var(sibling_mask);
+   return ret_freq;
+}
+
 static void set_pstate(void *pstate)
 {
unsigned long val;
@@ -310,6 +350,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
 static struct cpufreq_driver powernv_cpufreq_driver = {
.verify = powernv_cpufreq_verify,
.target = powernv_cpufreq_target,
+   .get= powernv_cpufreq_get,
.init   = powernv_cpufreq_cpu_init,
.exit   = powernv_cpufreq_cpu_exit,
.name   = "powernv-cpufreq",
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 14/52] powerpc, sysfs: Fix CPU hotplug callback registration

2014-03-06 Thread Gautham R Shenoy

Hello Ben,

On Fri, Mar 07, 2014 at 01:57:31PM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2014-02-14 at 13:22 +0530, Srivatsa S. Bhat wrote:
> > Subsystems that want to register CPU hotplug callbacks, as well as perform
> > initialization for the CPUs that are already online, often do it as shown
> > below:
> > 
> > get_online_cpus();
> > 
> > for_each_online_cpu(cpu)
> > init_cpu(cpu);
> > 
> > register_cpu_notifier(&foobar_cpu_notifier);
> > 
> > put_online_cpus();
> 
> This patch breaks a good half of my test configs with:
> 
> /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c: In function 
> 'topology_init':
> /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:979:2: error: 
> implicit declaration of function 'cpu_notifier_register_begin' 
> [-Werror=implicit-function-declaration]
> /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:1004:2: error: 
> implicit declaration of function '__register_cpu_notifier' 
> [-Werror=implicit-function-declaration]
> /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:1006:2: error: 
> implicit declaration of function 'cpu_notifier_register_done' 
> [-Werror=implicit-function-declaration]
> cc1: all warnings being treated as errors
> make[2]: *** [arch/powerpc/kernel/sysfs.o] Error 1
> make[2]: *** Waiting for unfinished jobs

This patch depends on
"[PATCH v2 02/52] CPU hotplug: Provide lockless versions of callback 
registration functions" 
of the series (Can be found here: https://lkml.org/lkml/2014/2/14/59).
This particular patch defines 'cpu_notifier_register_begin',
'__register_cpu_notifier', and 'cpu_notifier_register_done' in the
cpu-hotplug core. Are you seeing the build breakage with this patch on
?

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 0/6] powernv:cpufreq: Dynamic cpu-frequency scaling

2014-03-10 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

This is the v2 of the consolidated patchset consisting
patches for enabling cpufreq on IBM POWERNV platforms
along with some enhancements. 

The v1 of these patches have been previously 
submitted on linuxppc-dev [1][2].

- This patchset contains code for the platform driver to support CPU
  frequency scaling on IBM POWERNV platforms.

- In addition to the standard control and status files exposed by the
  cpufreq core, the patchset exposes the nominal frequency through the
  file named "cpuinfo_nominal_freq".

The patchset is based against commit c3bebc71c4bcdafa24b506adf0c1de3c1f77e2e0
of the mainline tree.

[1]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-February/115244.html
[2]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-March/115703.html

Gautham R. Shenoy (3):
  powernv:cpufreq: Create pstate_id_to_freq() helper
  powernv:cpufreq: Export nominal frequency via sysfs.
  powernv:cpufreq: Implement the driver->get() method

Srivatsa S. Bhat (2):
  powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.
  powernv,cpufreq:Add per-core locking to serialize frequency
transitions

Vaidyanathan Srinivasan (1):
  powernv: cpufreq driver for powernv platform

 arch/powerpc/include/asm/reg.h |   4 +
 arch/powerpc/platforms/powernv/Kconfig |   1 +
 drivers/cpufreq/Kconfig|   1 +
 drivers/cpufreq/Kconfig.powerpc|  13 ++
 drivers/cpufreq/Makefile   |   1 +
 drivers/cpufreq/powernv-cpufreq.c  | 397 +
 6 files changed, 417 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 5/6] powernv:cpufreq: Export nominal frequency via sysfs.

2014-03-10 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Create a driver attribute named cpuinfo_nominal_freq which
creates a sysfs read-only file named cpuinfo_nominal_freq. Export
the frequency corresponding to the nominal_pstate through this
interface.

Nominal frequency is the highest non-turbo frequency for the
platform.  This is generally used for setting governor policies from
user space for optimal energy efficiency.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 0ecd163..183bbc4 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -142,8 +142,30 @@ static unsigned int pstate_id_to_freq(int pstate_id)
return powernv_freqs[i].frequency;
 }
 
+/**
+ * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by
+ * the firmware
+ */
+static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy,
+   char *buf)
+{
+   int nominal_freq;
+   nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id);
+   return sprintf(buf, "%u\n", nominal_freq);
+}
+
+
+struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = {
+   .attr = { .name = "cpuinfo_nominal_freq",
+ .mode = 0444,
+   },
+   .show = show_cpuinfo_nominal_freq,
+};
+
+
 static struct freq_attr *powernv_cpu_freq_attr[] = {
&cpufreq_freq_attr_scaling_available_freqs,
+   &cpufreq_freq_attr_cpuinfo_nominal_freq,
NULL,
 };
 
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 1/6] powernv: cpufreq driver for powernv platform

2014-03-10 Thread Gautham R. Shenoy

From: Vaidyanathan Srinivasan 

Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Anton Blanchard 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/reg.h |   4 +
 arch/powerpc/platforms/powernv/Kconfig |   1 +
 drivers/cpufreq/Kconfig|   1 +
 drivers/cpufreq/Kconfig.powerpc|  13 ++
 drivers/cpufreq/Makefile   |   1 +
 drivers/cpufreq/powernv-cpufreq.c  | 277 +
 6 files changed, 297 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 90c06ec..84f92ca 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
 #define SPRN_IC0x350   /* Virtual Instruction Count */
 #define SPRN_VTB   0x351   /* Virtual Time Base */
+#define SPRN_PMICR 0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR  0x355   /* Power Management Status Reg */
+#define SPRN_PMCR  0x374   /* Power Management Control Register */
+
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_TAR_LG8   /* Enable Target Address Register */
 #define FSCR_EBB_LG7   /* Enable Event Based Branching */
diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 895e8a2..1fe12b1 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -11,6 +11,7 @@ config PPC_POWERNV
select PPC_UDBG_16550
select PPC_SCOM
select ARCH_RANDOM
+   select CPU_FREQ
default y
 
 config PPC_POWERNV_RTAS
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 4b029c0..4ba1632 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS
 choice
prompt "Default CPUFreq governor"
default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || 
ARM_SA1110_CPUFREQ
+   default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ
default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
help
  This option sets which CPUFreq governor shall be loaded at
diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
index ca0021a..93f8689 100644
--- a/drivers/cpufreq/Kconfig.powerpc
+++ b/drivers/cpufreq/Kconfig.powerpc
@@ -54,3 +54,16 @@ config PPC_PASEMI_CPUFREQ
help
  This adds the support for frequency switching on PA Semi
  PWRficient processors.
+
+config POWERNV_CPUFREQ
+   tristate "CPU frequency scaling for IBM POWERNV platform"
+   depends on PPC_POWERNV
+   select CPU_FREQ_GOV_PERFORMANCE
+   select CPU_FREQ_GOV_POWERSAVE
+   select CPU_FREQ_GOV_USERSPACE
+   select CPU_FREQ_GOV_ONDEMAND
+   select CPU_FREQ_GOV_CONSERVATIVE
+   default y
+   help
+This adds support for CPU frequency switching on IBM POWERNV
+platform
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 7494565..0dbb963 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC64)  += pmac64-cpufreq.o
 obj-$(CONFIG_PPC_PASEMI_CPUFREQ)   += pasemi-cpufreq.o
+obj-$(CONFIG_POWERNV_CPUFREQ)  += powernv-cpufreq.o
 
 
##
 # Other platform drivers
diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
new file mode 100644
index 000..ab1551f
--- /dev/null
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -0,0 +1,277 @@
+/*
+ * POWERNV cpufreq driver for the IBM POWER processors
+ *
+ * (C) Copyright IBM 2014
+ *
+ * Author: Vaidyanathan Srinivasan 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR

[PATCH v2 3/6] powernv, cpufreq:Add per-core locking to serialize frequency transitions

2014-03-10 Thread Gautham R. Shenoy

From: "Srivatsa S. Bhat" 

On POWER systems, the CPU frequency is controlled at a core-level and
hence we need to serialize so that only one of the threads in the core
switches the core's frequency at a time.

Using a global mutex lock would needlessly serialize _all_ frequency
transitions in the system (across all cores). So introduce per-core
locking to enable finer-grained synchronization and thereby enhance
the speed and responsiveness of the cpufreq driver to varying workload
demands.

The design of per-core locking is very simple and straight-forward: we
first define a Per-CPU lock and use the ones that belongs to the first
thread sibling of the core.

cpu_first_thread_sibling() macro is used to find the *common* lock for
all thread siblings belonging to a core.

Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 4cad727..4c2e8ca 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -24,8 +24,15 @@
 #include 
 #include 
 
-/* FIXME: Make this per-core */
-static DEFINE_MUTEX(freq_switch_mutex);
+/* Per-Core locking for frequency transitions */
+static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
+
+#define lock_core_freq(cpu)\
+   mutex_lock(&per_cpu(freq_switch_lock,\
+   cpu_first_thread_sibling(cpu)));
+#define unlock_core_freq(cpu)  \
+   mutex_unlock(&per_cpu(freq_switch_lock,\
+   cpu_first_thread_sibling(cpu)));
 
 #define POWERNV_MAX_PSTATES256
 
@@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
freqs.new = powernv_freqs[new_index].frequency;
freqs.cpu = policy->cpu;
 
-   mutex_lock(&freq_switch_mutex);
+   lock_core_freq(policy->cpu);
cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
 
pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
@@ -245,7 +252,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
rc = powernv_set_freq(policy->cpus, new_index);
 
cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
-   mutex_unlock(&freq_switch_mutex);
+   unlock_core_freq(policy->cpu);
 
return rc;
 }
@@ -262,7 +269,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
 
 static int __init powernv_cpufreq_init(void)
 {
-   int rc = 0;
+   int cpu, rc = 0;
 
/* Discover pstates from device tree and init */
 
@@ -272,6 +279,10 @@ static int __init powernv_cpufreq_init(void)
pr_info("powernv-cpufreq disabled\n");
return rc;
}
+   /* Init per-core mutex */
+   for_each_possible_cpu(cpu) {
+   mutex_init(&per_cpu(freq_switch_lock, cpu));
+   }
 
rc = cpufreq_register_driver(&powernv_cpufreq_driver);
return rc;
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 2/6] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.

2014-03-10 Thread Gautham R. Shenoy

From: "Srivatsa S. Bhat" 

Create a helper method that computes the cpumask corresponding to the
thread-siblings of a cpu. Use this for initializing the policy->cpus
mask for a given cpu.

(Original code written by Srivatsa S. Bhat. Gautham moved this to a
helper function!)

Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index ab1551f..4cad727 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -115,6 +115,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = {
 
 /* Helper routines */
 
+/**
+ * Sets the bits corresponding to the thread-siblings of cpu in its core
+ * in 'cpus'.
+ */
+static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus)
+{
+   int base, i;
+
+   base = cpu_first_thread_sibling(cpu);
+
+   for (i = 0; i < threads_per_core; i++) {
+   cpumask_set_cpu(base + i, cpus);
+   }
+
+   return;
+}
+
 /* Access helpers to power mgt SPR */
 
 static inline unsigned long get_pmspr(unsigned long sprn)
@@ -180,13 +197,8 @@ static int powernv_set_freq(cpumask_var_t cpus, unsigned 
int new_index)
 
 static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 {
-   int base, i;
-
 #ifdef CONFIG_SMP
-   base = cpu_first_thread_sibling(policy->cpu);
-
-   for (i = 0; i < threads_per_core; i++)
-   cpumask_set_cpu(base + i, policy->cpus);
+   powernv_cpu_to_core_mask(policy->cpu, policy->cpus);
 #endif
policy->cpuinfo.transition_latency = 25000;
 
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 6/6] powernv:cpufreq: Implement the driver->get() method

2014-03-10 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

The current frequency of a cpu is reported through the sysfs file
cpuinfo_cur_freq. This requires the driver to implement a
"->get(unsigned int cpu)" method which will return the current
operating frequency.

Implement a function named powernv_cpufreq_get() which reads the local
pstate from the PMSR and returns the corresponding frequency.

Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get().

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 48 +++
 1 file changed, 48 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 183bbc4..6f3b6e1 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -223,6 +223,53 @@ static inline void set_pmspr(unsigned long sprn, unsigned 
long val)
BUG();
 }
 
+/*
+ * Computes the current frequency on this cpu
+ * and stores the result in *ret_freq.
+ */
+static void powernv_read_cpu_freq(void *ret_freq)
+{
+   unsigned long pmspr_val;
+   s8 local_pstate_id;
+   int *cur_freq, freq, pstate_id;
+
+   cur_freq = (int *)ret_freq;
+   pmspr_val = get_pmspr(SPRN_PMSR);
+
+   /* The local pstate id corresponds bits 48..55 in the PMSR.
+ * Note: Watch out for the sign! */
+   local_pstate_id = (pmspr_val >> 48) & 0xFF;
+   pstate_id = local_pstate_id;
+
+   freq = pstate_id_to_freq(pstate_id);
+   pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n",
+   smp_processor_id(), pmspr_val, pstate_id, freq);
+   *cur_freq = freq;
+}
+
+/*
+ * Returns the cpu frequency as reported by the firmware for 'cpu'.
+ * This value is reported through the sysfs file cpuinfo_cur_freq.
+ */
+unsigned int powernv_cpufreq_get(unsigned int cpu)
+{
+   int ret_freq;
+   cpumask_var_t sibling_mask;
+
+   if (unlikely(!zalloc_cpumask_var(&sibling_mask, GFP_KERNEL))) {
+   smp_call_function_single(cpu, powernv_read_cpu_freq,
+   &ret_freq, 1);
+   return ret_freq;
+   }
+
+   powernv_cpu_to_core_mask(cpu, sibling_mask);
+   smp_call_function_any(sibling_mask, powernv_read_cpu_freq,
+   &ret_freq, 1);
+
+   free_cpumask_var(sibling_mask);
+   return ret_freq;
+}
+
 static void set_pstate(void *pstate)
 {
unsigned long val;
@@ -309,6 +356,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
 static struct cpufreq_driver powernv_cpufreq_driver = {
.verify = powernv_cpufreq_verify,
.target = powernv_cpufreq_target,
+   .get= powernv_cpufreq_get,
.init   = powernv_cpufreq_cpu_init,
.exit   = powernv_cpufreq_cpu_exit,
.name   = "powernv-cpufreq",
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 4/6] powernv:cpufreq: Create pstate_id_to_freq() helper

2014-03-10 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Create a helper routine that can return the cpu-frequency for the
corresponding pstate_id.

Also, cache the values of the pstate_max, pstate_min and
pstate_nominal and nr_pstates in a static structure so that they can
be reused in the future to perform any validations.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 4c2e8ca..0ecd163 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
 static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
 
+struct powernv_pstate_info {
+   int pstate_min_id;
+   int pstate_max_id;
+   int pstate_nominal_id;
+   int nr_pstates;
+};
+static struct powernv_pstate_info powernv_pstate_info;
+
 /*
  * Initialize the freq table based on data obtained
  * from the firmware passed via device-tree
@@ -112,9 +120,28 @@ static int init_powernv_pstates(void)
for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
pr_debug("%d: %d\n", i, powernv_freqs[i].frequency);
 
+   powernv_pstate_info.pstate_min_id = pstate_min;
+   powernv_pstate_info.pstate_max_id = pstate_max;
+   powernv_pstate_info.pstate_nominal_id = pstate_nominal;
+   powernv_pstate_info.nr_pstates = nr_pstates;
+
return 0;
 }
 
+/**
+ * Returns the cpu frequency corresponding to the pstate_id.
+ */
+static unsigned int pstate_id_to_freq(int pstate_id)
+{
+   int i;
+
+   i = powernv_pstate_info.pstate_max_id - pstate_id;
+
+   BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
+   WARN_ON(powernv_pstate_ids[i] != pstate_id);
+   return powernv_freqs[i].frequency;
+}
+
 static struct freq_attr *powernv_cpu_freq_attr[] = {
&cpufreq_freq_attr_scaling_available_freqs,
NULL,
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 1/2] powerpc: powernv: Framework to show the correct clock in /proc/cpuinfo

2014-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently, the code in setup-common.c for powerpc assumes that all
clock rates are same in a smp system. This value is cached in the
variable named ppc_proc_freq and is the value that is reported in
/proc/cpuinfo.

However on the PowerNV platform, the clock rate is same only across
the threads of the same core. Hence the value that is reported in
/proc/cpuinfo is incorrect on PowerNV platforms. We need a better way
to query and report the correct value of the processor clock in
/proc/cpuinfo.

The patch achieves this by creating a machdep_call named
get_proc_freq() which is expected to returns the frequency in Hz. The
code in show_cpuinfo() can invoke this method to display the correct
clock rate on platforms that have implemented this method. On the
other powerpc platforms it can use the value cached in ppc_proc_freq.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/machdep.h |  2 ++
 arch/powerpc/kernel/setup-common.c | 16 
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index ad3025d..eb1c6d4 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -113,6 +113,8 @@ struct machdep_calls {
/* Optional, may be NULL. */
void(*show_cpuinfo)(struct seq_file *m);
void(*show_percpuinfo)(struct seq_file *m, int i);
+   /* Returns the current operating frequency of "cpu" in Hz */
+   unsigned long   (*get_proc_freq)(unsigned int cpu);
 
void(*init_IRQ)(void);
 
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index bc76cc6..bdd3045 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -209,6 +209,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 {
unsigned long cpu_id = (unsigned long)v - 1;
unsigned int pvr;
+   unsigned long proc_freq;
unsigned short maj;
unsigned short min;
 
@@ -260,12 +261,19 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 #endif /* CONFIG_TAU */
 
/*
-* Assume here that all clock rates are the same in a
-* smp system.  -- Cort
+* Platforms that have variable clock rates, should implement
+* the method ppc_md.get_proc_freq() that reports the clock
+* rate of a given cpu. The rest can use ppc_proc_freq to
+* report the clock rate that is same across all cpus.
 */
-   if (ppc_proc_freq)
+   if (ppc_md.get_proc_freq)
+   proc_freq = ppc_md.get_proc_freq(cpu_id);
+   else
+   proc_freq = ppc_proc_freq;
+
+   if (proc_freq)
seq_printf(m, "clock\t\t: %lu.%06luMHz\n",
-  ppc_proc_freq / 100, ppc_proc_freq % 100);
+  proc_freq / 100, proc_freq % 100);
 
if (ppc_md.show_percpuinfo != NULL)
ppc_md.show_percpuinfo(m, cpu_id);
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 2/2] powerpc: powernv: Implement ppc_md.get_proc_freq()

2014-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Implement a method named pnv_get_proc_freq(unsigned int cpu) which
returns the current clock rate on the 'cpu' in Hz to be reported in
/proc/cpuinfo. This method uses the value reported by cpufreq when
such a value is sane. Otherwise it falls back to old way of reporting
the clockrate, i.e. ppc_proc_freq.

Set the ppc_md.get_proc_freq() hook to pnv_get_proc_freq() on the
PowerNV platform.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/powernv/setup.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 110f4fb..423e36d 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -235,6 +236,25 @@ void powernv_idle(void)
}
 }
 
+/*
+ * Returns the cpu frequency for 'cpu' in Hz. This is used by
+ * /proc/cpuinfo
+ */
+unsigned long pnv_get_proc_freq(unsigned int cpu)
+{
+   unsigned long ret_freq;
+
+   ret_freq = cpufreq_quick_get(cpu) * 1000ul;
+
+   /*
+* If the backend cpufreq driver does not exist,
+ * then fallback to old way of reporting the clockrate.
+*/
+   if (!ret_freq)
+   ret_freq = ppc_proc_freq;
+   return ret_freq;
+}
+
 define_machine(powernv) {
.name   = "PowerNV",
.probe  = pnv_probe,
@@ -242,6 +262,7 @@ define_machine(powernv) {
.setup_arch = pnv_setup_arch,
.init_IRQ   = pnv_init_IRQ,
.show_cpuinfo   = pnv_show_cpuinfo,
+   .get_proc_freq  = pnv_get_proc_freq,
.progress   = pnv_progress,
.machine_shutdown   = pnv_shutdown,
.power_save = powernv_idle,
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 0/2] powernv: Show the correct clock value in /proc/cpuinfo

2014-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

Currently, the code in setup-common.c for powerpc assumes that all
clock rates are same in a smp system. This value is cached in the
variable named ppc_proc_freq and is the value that is reported in
/proc/cpuinfo.

However on the PowerNV platform, the clock rate is same only across
the threads of the same core. Hence the value that is reported in
/proc/cpuinfo is incorrect on PowerNV platforms.

This patch-series fixes this problem by having /proc/cpuinfo report
the value returned by cpufreq_quick_get(cpu) whenever the cpufreq
backend driver is available and fallback to the old way of reporting
the clock rate in its absence.

These patches depend on the patches to enable dynamic cpufrequency
scaling on PowerNV that can be found here:
http://linuxppc.10917.n7.nabble.com/PATCH-v2-0-6-powernv-cpufreq-Dynamic-cpu-frequency-scaling-td80641.html

Gautham R. Shenoy (2):
  powerpc: powernv: Framework to show the correct clock in /proc/cpuinfo
  powerpc: powernv: Implement ppc_md.get_proc_freq()

 arch/powerpc/include/asm/machdep.h |  2 ++
 arch/powerpc/kernel/setup-common.c | 16 
 arch/powerpc/platforms/powernv/setup.c | 21 +
 3 files changed, 35 insertions(+), 4 deletions(-)

-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 2/6] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.

2014-03-19 Thread Gautham R Shenoy

On Wed, Mar 19, 2014 at 12:05:20PM +0530, Srivatsa S. Bhat wrote:
> On 03/19/2014 05:07 AM, Benjamin Herrenschmidt wrote:
> > On Mon, 2014-03-10 at 16:40 +0530, Gautham R. Shenoy wrote:
> >> From: "Srivatsa S. Bhat" 
> >>
> >> Create a helper method that computes the cpumask corresponding to the
> >> thread-siblings of a cpu. Use this for initializing the policy->cpus
> >> mask for a given cpu.
> >>
> >> (Original code written by Srivatsa S. Bhat. Gautham moved this to a
> >> helper function!)
> > 
> > How does that differ from the existing entry in the sibling map  which
> > you can obtain with the helper cpu_sibling_mask() ? (You probably want
> > to *copy* the mask of course but I don't see the need of re-creating
> > one from scratch).
> >
> 
> The intent here was to have a sibling mask that is invariant of CPU
> hotplug. cpu_sibling_mask is updated on every hotplug operation, and this
> would break cpufreq, since policy->cpus has to be hotplug invariant.
> 
> This should have been noted in the changelog of patch 1 as well as this
> patch. (The earlier (internal) versions of this patchset had the
> description in the changelogs, but somehow it got dropped accidentally).
> I'll work with Gautham and ensure that we have this important info
> included in the changelog. Thanks for pointing it out!

I reused that part of the code because I was unaware of
cpu_sibling_mask()!

For implementing powernv_cpufreq_get(), cpu_sibling_mask() suffices
since we are using this mask as a parameter to
smp_call_function_any(), and hence are concerned only about the online
siblings. 

I shall fix the changelog to reflect Srivatsa's reason for having a
mask that does not vary with cpu-hotplug.

> 
> Regards,
> Srivatsa S. Bhat
> 
> 
> > Also, this should have been CCed to the cpufreq mailing list and maybe
> > the relevant maintainer too.

Ok. Will do it in the subsequent version. 

Thanks for the review!

> > 
> > Cheers,
> > Ben.
> > 
> >> Signed-off-by: Srivatsa S. Bhat 
> >> Signed-off-by: Gautham R. Shenoy 
> >> ---
> >>  drivers/cpufreq/powernv-cpufreq.c | 24 ++--
> >>  1 file changed, 18 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> >> b/drivers/cpufreq/powernv-cpufreq.c
> >> index ab1551f..4cad727 100644
> >> --- a/drivers/cpufreq/powernv-cpufreq.c
> >> +++ b/drivers/cpufreq/powernv-cpufreq.c
> >> @@ -115,6 +115,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = {
> >>  
> >>  /* Helper routines */
> >>  
> >> +/**
> >> + * Sets the bits corresponding to the thread-siblings of cpu in its core
> >> + * in 'cpus'.
> >> + */
> >> +static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus)
> >> +{
> >> +  int base, i;
> >> +
> >> +  base = cpu_first_thread_sibling(cpu);
> >> +
> >> +  for (i = 0; i < threads_per_core; i++) {
> >> +  cpumask_set_cpu(base + i, cpus);
> >> +  }
> >> +
> >> +  return;
> >> +}
> >> +
> >>  /* Access helpers to power mgt SPR */
> >>  
> >>  static inline unsigned long get_pmspr(unsigned long sprn)
> >> @@ -180,13 +197,8 @@ static int powernv_set_freq(cpumask_var_t cpus, 
> >> unsigned int new_index)
> >>  
> >>  static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
> >>  {
> >> -  int base, i;
> >> -
> >>  #ifdef CONFIG_SMP
> >> -  base = cpu_first_thread_sibling(policy->cpu);
> >> -
> >> -  for (i = 0; i < threads_per_core; i++)
> >> -  cpumask_set_cpu(base + i, policy->cpus);
> >> +  powernv_cpu_to_core_mask(policy->cpu, policy->cpus);
> >>  #endif
> >>policy->cpuinfo.transition_latency = 25000;
> >>  
> > 
> > 
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 0/5] powernv: Enable Dynamic Frequency

2014-03-20 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

This is the v3 of the consolidated patchset consisting
patches for enabling cpufreq on IBM POWERNV platforms
along with some enhancements. I have incorporated the review 
for v2 (which can be found at [1]).

- This patchset contains code for the platform driver to support CPU
  frequency scaling on IBM POWERNV platforms.

- In addition to the standard control and status files exposed by the
  cpufreq core, the patchset exposes the nominal frequency through the
  file named "cpuinfo_nominal_freq".


The changes from from v2:
  - Updated the changelog for "powernv: cpufreq driver for powernv platform"
to record the fact that policy->cpus must be populated with a sibling mask
that includes even the offlined siblings.

  - Dropped the patch named 
"powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper"
 since the driver->get() method can use cpu_sibling_mask().
  
  -  Updated "powernv:cpufreq: Implement the driver->get() method" to 
 use cpu_sibling_mask() in powernv_cpufreq_get()

The patchset is based against commit 4907cdca7210c5895311bddcf05a4c85b67d8566
of the mainline tree.

[1]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-March/115861.html


Gautham R. Shenoy (3):
  powernv:cpufreq: Create pstate_id_to_freq() helper
  powernv:cpufreq: Export nominal frequency via sysfs.
  powernv:cpufreq: Implement the driver->get() method

Srivatsa S. Bhat (1):
  powernv,cpufreq:Add per-core locking to serialize frequency
transitions

Vaidyanathan Srinivasan (1):
  powernv: cpufreq driver for powernv platform

 arch/powerpc/include/asm/reg.h |   4 +
 arch/powerpc/platforms/powernv/Kconfig |   1 +
 drivers/cpufreq/Kconfig|   1 +
 drivers/cpufreq/Kconfig.powerpc|  13 ++
 drivers/cpufreq/Makefile   |   1 +
 drivers/cpufreq/powernv-cpufreq.c  | 375 +
 6 files changed, 395 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 1/5] powernv: cpufreq driver for powernv platform

2014-03-20 Thread Gautham R. Shenoy

From: Vaidyanathan Srinivasan 

Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

The policy->cpus needs to be populated in a hotplug-invariant manner
instead of using cpu_sibling_mask() which varies with cpu-hotplug. This
is because the cpufreq core code copies this content into policy->related_cpus
mask which is should not vary on cpu-hotplug.

[Change log updated by e...@linux.vnet.ibm.com]

Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Anton Blanchard 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/reg.h |   4 +
 arch/powerpc/platforms/powernv/Kconfig |   1 +
 drivers/cpufreq/Kconfig|   1 +
 drivers/cpufreq/Kconfig.powerpc|  13 ++
 drivers/cpufreq/Makefile   |   1 +
 drivers/cpufreq/powernv-cpufreq.c  | 277 +
 6 files changed, 297 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 90c06ec..84f92ca 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
 #define SPRN_IC0x350   /* Virtual Instruction Count */
 #define SPRN_VTB   0x351   /* Virtual Time Base */
+#define SPRN_PMICR 0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR  0x355   /* Power Management Status Reg */
+#define SPRN_PMCR  0x374   /* Power Management Control Register */
+
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_TAR_LG8   /* Enable Target Address Register */
 #define FSCR_EBB_LG7   /* Enable Event Based Branching */
diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 895e8a2..1fe12b1 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -11,6 +11,7 @@ config PPC_POWERNV
select PPC_UDBG_16550
select PPC_SCOM
select ARCH_RANDOM
+   select CPU_FREQ
default y
 
 config PPC_POWERNV_RTAS
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 4b029c0..4ba1632 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS
 choice
prompt "Default CPUFreq governor"
default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || 
ARM_SA1110_CPUFREQ
+   default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ
default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
help
  This option sets which CPUFreq governor shall be loaded at
diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
index ca0021a..93f8689 100644
--- a/drivers/cpufreq/Kconfig.powerpc
+++ b/drivers/cpufreq/Kconfig.powerpc
@@ -54,3 +54,16 @@ config PPC_PASEMI_CPUFREQ
help
  This adds the support for frequency switching on PA Semi
  PWRficient processors.
+
+config POWERNV_CPUFREQ
+   tristate "CPU frequency scaling for IBM POWERNV platform"
+   depends on PPC_POWERNV
+   select CPU_FREQ_GOV_PERFORMANCE
+   select CPU_FREQ_GOV_POWERSAVE
+   select CPU_FREQ_GOV_USERSPACE
+   select CPU_FREQ_GOV_ONDEMAND
+   select CPU_FREQ_GOV_CONSERVATIVE
+   default y
+   help
+This adds support for CPU frequency switching on IBM POWERNV
+platform
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 7494565..0dbb963 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC64)  += pmac64-cpufreq.o
 obj-$(CONFIG_PPC_PASEMI_CPUFREQ)   += pasemi-cpufreq.o
+obj-$(CONFIG_POWERNV_CPUFREQ)  += powernv-cpufreq.o
 
 
##
 # Other platform drivers
diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
new file mode 100644
index 000..ab1551f
--- /dev/null
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -0,0 +1,277 @@
+/*
+ * POWERNV cpufreq driver for the IBM POWER processors
+ *
+ * (C) Copyright IBM 2014
+ *
+ * Author: Vaidyanathan Srinivasan 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms

[PATCH v3 2/5] powernv, cpufreq:Add per-core locking to serialize frequency transitions

2014-03-20 Thread Gautham R. Shenoy

From: "Srivatsa S. Bhat" 

On POWER systems, the CPU frequency is controlled at a core-level and
hence we need to serialize so that only one of the threads in the core
switches the core's frequency at a time.

Using a global mutex lock would needlessly serialize _all_ frequency
transitions in the system (across all cores). So introduce per-core
locking to enable finer-grained synchronization and thereby enhance
the speed and responsiveness of the cpufreq driver to varying workload
demands.

The design of per-core locking is very simple and straight-forward: we
first define a Per-CPU lock and use the ones that belongs to the first
thread sibling of the core.

cpu_first_thread_sibling() macro is used to find the *common* lock for
all thread siblings belonging to a core.

Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index ab1551f..66dae0d 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -24,8 +24,15 @@
 #include 
 #include 
 
-/* FIXME: Make this per-core */
-static DEFINE_MUTEX(freq_switch_mutex);
+/* Per-Core locking for frequency transitions */
+static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
+
+#define lock_core_freq(cpu)\
+   mutex_lock(&per_cpu(freq_switch_lock,\
+   cpu_first_thread_sibling(cpu)));
+#define unlock_core_freq(cpu)  \
+   mutex_unlock(&per_cpu(freq_switch_lock,\
+   cpu_first_thread_sibling(cpu)));
 
 #define POWERNV_MAX_PSTATES256
 
@@ -221,7 +228,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
freqs.new = powernv_freqs[new_index].frequency;
freqs.cpu = policy->cpu;
 
-   mutex_lock(&freq_switch_mutex);
+   lock_core_freq(policy->cpu);
cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
 
pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
@@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
rc = powernv_set_freq(policy->cpus, new_index);
 
cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
-   mutex_unlock(&freq_switch_mutex);
+   unlock_core_freq(policy->cpu);
 
return rc;
 }
@@ -250,7 +257,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
 
 static int __init powernv_cpufreq_init(void)
 {
-   int rc = 0;
+   int cpu, rc = 0;
 
/* Discover pstates from device tree and init */
 
@@ -260,6 +267,10 @@ static int __init powernv_cpufreq_init(void)
pr_info("powernv-cpufreq disabled\n");
return rc;
}
+   /* Init per-core mutex */
+   for_each_possible_cpu(cpu) {
+   mutex_init(&per_cpu(freq_switch_lock, cpu));
+   }
 
rc = cpufreq_register_driver(&powernv_cpufreq_driver);
return rc;
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 3/5] powernv:cpufreq: Create pstate_id_to_freq() helper

2014-03-20 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Create a helper routine that can return the cpu-frequency for the
corresponding pstate_id.

Also, cache the values of the pstate_max, pstate_min and
pstate_nominal and nr_pstates in a static structure so that they can
be reused in the future to perform any validations.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 66dae0d..e7b0292 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
 static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
 
+struct powernv_pstate_info {
+   int pstate_min_id;
+   int pstate_max_id;
+   int pstate_nominal_id;
+   int nr_pstates;
+};
+static struct powernv_pstate_info powernv_pstate_info;
+
 /*
  * Initialize the freq table based on data obtained
  * from the firmware passed via device-tree
@@ -112,9 +120,28 @@ static int init_powernv_pstates(void)
for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
pr_debug("%d: %d\n", i, powernv_freqs[i].frequency);
 
+   powernv_pstate_info.pstate_min_id = pstate_min;
+   powernv_pstate_info.pstate_max_id = pstate_max;
+   powernv_pstate_info.pstate_nominal_id = pstate_nominal;
+   powernv_pstate_info.nr_pstates = nr_pstates;
+
return 0;
 }
 
+/**
+ * Returns the cpu frequency corresponding to the pstate_id.
+ */
+static unsigned int pstate_id_to_freq(int pstate_id)
+{
+   int i;
+
+   i = powernv_pstate_info.pstate_max_id - pstate_id;
+
+   BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
+   WARN_ON(powernv_pstate_ids[i] != pstate_id);
+   return powernv_freqs[i].frequency;
+}
+
 static struct freq_attr *powernv_cpu_freq_attr[] = {
&cpufreq_freq_attr_scaling_available_freqs,
NULL,
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 4/5] powernv:cpufreq: Export nominal frequency via sysfs.

2014-03-20 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Create a driver attribute named cpuinfo_nominal_freq which
creates a sysfs read-only file named cpuinfo_nominal_freq. Export
the frequency corresponding to the nominal_pstate through this
interface.

Nominal frequency is the highest non-turbo frequency for the
platform.  This is generally used for setting governor policies from
user space for optimal energy efficiency.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index e7b0292..46bee8a 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -142,8 +142,30 @@ static unsigned int pstate_id_to_freq(int pstate_id)
return powernv_freqs[i].frequency;
 }
 
+/**
+ * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by
+ * the firmware
+ */
+static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy,
+   char *buf)
+{
+   int nominal_freq;
+   nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id);
+   return sprintf(buf, "%u\n", nominal_freq);
+}
+
+
+struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = {
+   .attr = { .name = "cpuinfo_nominal_freq",
+ .mode = 0444,
+   },
+   .show = show_cpuinfo_nominal_freq,
+};
+
+
 static struct freq_attr *powernv_cpu_freq_attr[] = {
&cpufreq_freq_attr_scaling_available_freqs,
+   &cpufreq_freq_attr_cpuinfo_nominal_freq,
NULL,
 };
 
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method

2014-03-20 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

The current frequency of a cpu is reported through the sysfs file
cpuinfo_cur_freq. This requires the driver to implement a
"->get(unsigned int cpu)" method which will return the current
operating frequency.

Implement a function named powernv_cpufreq_get() which reads the local
pstate from the PMSR and returns the corresponding frequency.

Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get().

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 46bee8a..ef6ed8c 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, unsigned 
long val)
BUG();
 }
 
+/*
+ * Computes the current frequency on this cpu
+ * and stores the result in *ret_freq.
+ */
+static void powernv_read_cpu_freq(void *ret_freq)
+{
+   unsigned long pmspr_val;
+   s8 local_pstate_id;
+   int *cur_freq, freq, pstate_id;
+
+   cur_freq = (int *)ret_freq;
+   pmspr_val = get_pmspr(SPRN_PMSR);
+
+   /* The local pstate id corresponds bits 48..55 in the PMSR.
+ * Note: Watch out for the sign! */
+   local_pstate_id = (pmspr_val >> 48) & 0xFF;
+   pstate_id = local_pstate_id;
+
+   freq = pstate_id_to_freq(pstate_id);
+   pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n",
+   smp_processor_id(), pmspr_val, pstate_id, freq);
+   *cur_freq = freq;
+}
+
+/*
+ * Returns the cpu frequency as reported by the firmware for 'cpu'.
+ * This value is reported through the sysfs file cpuinfo_cur_freq.
+ */
+unsigned int powernv_cpufreq_get(unsigned int cpu)
+{
+   int ret_freq;
+
+   smp_call_function_any(cpu_sibling_mask(cpu), powernv_read_cpu_freq,
+   &ret_freq, 1);
+   return ret_freq;
+}
+
 static void set_pstate(void *pstate)
 {
unsigned long val;
@@ -297,6 +334,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
 static struct cpufreq_driver powernv_cpufreq_driver = {
.verify = powernv_cpufreq_verify,
.target = powernv_cpufreq_target,
+   .get= powernv_cpufreq_get,
.init   = powernv_cpufreq_cpu_init,
.exit   = powernv_cpufreq_cpu_exit,
.name   = "powernv-cpufreq",
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/5] powernv, cpufreq:Add per-core locking to serialize frequency transitions

2014-03-20 Thread Gautham R Shenoy

On Thu, Mar 20, 2014 at 05:40:57PM +0530, Gautham R. Shenoy wrote:
> From: "Srivatsa S. Bhat" 
> 
> On POWER systems, the CPU frequency is controlled at a core-level and
> hence we need to serialize so that only one of the threads in the core
> switches the core's frequency at a time.
> 
> Using a global mutex lock would needlessly serialize _all_ frequency
> transitions in the system (across all cores). So introduce per-core
> locking to enable finer-grained synchronization and thereby enhance
> the speed and responsiveness of the cpufreq driver to varying workload
> demands.
> 
> The design of per-core locking is very simple and straight-forward: we
> first define a Per-CPU lock and use the ones that belongs to the first
> thread sibling of the core.
> 
> cpu_first_thread_sibling() macro is used to find the *common* lock for
> all thread siblings belonging to a core.
> 

Forgot to add the following line.

Reviewed-by: Preeti U Murthy 
> Signed-off-by: Srivatsa S. Bhat 
> Signed-off-by: Vaidyanathan Srinivasan 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 21 -
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index ab1551f..66dae0d 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -24,8 +24,15 @@
>  #include 
>  #include 
> 
> -/* FIXME: Make this per-core */
> -static DEFINE_MUTEX(freq_switch_mutex);
> +/* Per-Core locking for frequency transitions */
> +static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
> +
> +#define lock_core_freq(cpu)  \
> + mutex_lock(&per_cpu(freq_switch_lock,\
> + cpu_first_thread_sibling(cpu)));
> +#define unlock_core_freq(cpu)\
> + mutex_unlock(&per_cpu(freq_switch_lock,\
> + cpu_first_thread_sibling(cpu)));
> 
>  #define POWERNV_MAX_PSTATES  256
> 
> @@ -221,7 +228,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
> *policy,
>   freqs.new = powernv_freqs[new_index].frequency;
>   freqs.cpu = policy->cpu;
> 
> - mutex_lock(&freq_switch_mutex);
> + lock_core_freq(policy->cpu);
>   cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
> 
>   pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
> @@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
> *policy,
>   rc = powernv_set_freq(policy->cpus, new_index);
> 
>   cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
> - mutex_unlock(&freq_switch_mutex);
> + unlock_core_freq(policy->cpu);
> 
>   return rc;
>  }
> @@ -250,7 +257,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
> 
>  static int __init powernv_cpufreq_init(void)
>  {
> - int rc = 0;
> + int cpu, rc = 0;
> 
>   /* Discover pstates from device tree and init */
> 
> @@ -260,6 +267,10 @@ static int __init powernv_cpufreq_init(void)
>   pr_info("powernv-cpufreq disabled\n");
>   return rc;
>   }
> + /* Init per-core mutex */
> + for_each_possible_cpu(cpu) {
> + mutex_init(&per_cpu(freq_switch_lock, cpu));
> + }
> 
>   rc = cpufreq_register_driver(&powernv_cpufreq_driver);
>   return rc;
> -- 
> 1.8.3.1
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 3/5] powernv:cpufreq: Create pstate_id_to_freq() helper

2014-03-20 Thread Gautham R Shenoy

On Thu, Mar 20, 2014 at 05:40:58PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> Create a helper routine that can return the cpu-frequency for the
> corresponding pstate_id.
> 
> Also, cache the values of the pstate_max, pstate_min and
> pstate_nominal and nr_pstates in a static structure so that they can
> be reused in the future to perform any validations.
> 

Forgot to add the following line:

  Reviewed-by: Preeti U Murthy 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 66dae0d..e7b0292 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
>  static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>  static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
> 
> +struct powernv_pstate_info {
> + int pstate_min_id;
> + int pstate_max_id;
> + int pstate_nominal_id;
> + int nr_pstates;
> +};
> +static struct powernv_pstate_info powernv_pstate_info;
> +
>  /*
>   * Initialize the freq table based on data obtained
>   * from the firmware passed via device-tree
> @@ -112,9 +120,28 @@ static int init_powernv_pstates(void)
>   for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
>   pr_debug("%d: %d\n", i, powernv_freqs[i].frequency);
> 
> + powernv_pstate_info.pstate_min_id = pstate_min;
> + powernv_pstate_info.pstate_max_id = pstate_max;
> + powernv_pstate_info.pstate_nominal_id = pstate_nominal;
> + powernv_pstate_info.nr_pstates = nr_pstates;
> +
>   return 0;
>  }
> 
> +/**
> + * Returns the cpu frequency corresponding to the pstate_id.
> + */
> +static unsigned int pstate_id_to_freq(int pstate_id)
> +{
> + int i;
> +
> + i = powernv_pstate_info.pstate_max_id - pstate_id;
> +
> + BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
> + WARN_ON(powernv_pstate_ids[i] != pstate_id);
> + return powernv_freqs[i].frequency;
> +}
> +
>  static struct freq_attr *powernv_cpu_freq_attr[] = {
>   &cpufreq_freq_attr_scaling_available_freqs,
>   NULL,
> -- 
> 1.8.3.1
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method

2014-03-20 Thread Gautham R Shenoy

On Thu, Mar 20, 2014 at 05:41:00PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> The current frequency of a cpu is reported through the sysfs file
> cpuinfo_cur_freq. This requires the driver to implement a
> "->get(unsigned int cpu)" method which will return the current
> operating frequency.
> 
> Implement a function named powernv_cpufreq_get() which reads the local
> pstate from the PMSR and returns the corresponding frequency.
> 
> Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get().

Forgot to add 

  Reviewed-by: Preeti U Murthy 
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 38 ++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 46bee8a..ef6ed8c 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, 
> unsigned long val)
>   BUG();
>  }
> 
> +/*
> + * Computes the current frequency on this cpu
> + * and stores the result in *ret_freq.
> + */
> +static void powernv_read_cpu_freq(void *ret_freq)
> +{
> + unsigned long pmspr_val;
> + s8 local_pstate_id;
> + int *cur_freq, freq, pstate_id;
> +
> + cur_freq = (int *)ret_freq;
> + pmspr_val = get_pmspr(SPRN_PMSR);
> +
> + /* The local pstate id corresponds bits 48..55 in the PMSR.
> + * Note: Watch out for the sign! */
> + local_pstate_id = (pmspr_val >> 48) & 0xFF;
> + pstate_id = local_pstate_id;
> +
> + freq = pstate_id_to_freq(pstate_id);
> + pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n",
> + smp_processor_id(), pmspr_val, pstate_id, freq);
> + *cur_freq = freq;
> +}
> +
> +/*
> + * Returns the cpu frequency as reported by the firmware for 'cpu'.
> + * This value is reported through the sysfs file cpuinfo_cur_freq.
> + */
> +unsigned int powernv_cpufreq_get(unsigned int cpu)
> +{
> + int ret_freq;
> +
> + smp_call_function_any(cpu_sibling_mask(cpu), powernv_read_cpu_freq,
> + &ret_freq, 1);
> + return ret_freq;
> +}
> +
>  static void set_pstate(void *pstate)
>  {
>   unsigned long val;
> @@ -297,6 +334,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
> *policy,
>  static struct cpufreq_driver powernv_cpufreq_driver = {
>   .verify = powernv_cpufreq_verify,
>   .target = powernv_cpufreq_target,
> + .get= powernv_cpufreq_get,
>   .init   = powernv_cpufreq_cpu_init,
>   .exit   = powernv_cpufreq_cpu_exit,
>   .name   = "powernv-cpufreq",
> -- 
> 1.8.3.1
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/5] powernv: Enable Dynamic Frequency

2014-03-21 Thread Gautham R Shenoy

Hi Viresh,

On Fri, Mar 21, 2014 at 01:16:03PM +0530, Viresh Kumar wrote:
> On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy
>  wrote:
> > This is the v3 of the consolidated patchset consisting
> > patches for enabling cpufreq on IBM POWERNV platforms
> > along with some enhancements.
> 
> This is the first time I saw them. Looks like you never Cc'd linux-pm list.
> Also, would be better to keep Maintainers in --to field so that they can
> review them as soon as possible. Otherwise they might stay on lists for
> a long time..

Yes, I hadn't Cc'ed linux-pm list last time around. Sorry about
that. Will keep this in mind the next time around. Also, your
suggestion regarding keeping the Maintainers in --to field is well
taken.

--
Thanks and Regards
gautham.
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 4/5] powernv:cpufreq: Export nominal frequency via sysfs.

2014-03-21 Thread Gautham R Shenoy

On Fri, Mar 21, 2014 at 02:17:28PM +0530, Viresh Kumar wrote:
> > +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy,
> > +   char *buf)
> > +{
> > +   int nominal_freq;
> > +   nominal_freq = 
> > pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id);
> > +   return sprintf(buf, "%u\n", nominal_freq);
> 
> return sprintf(buf, "%u\n",
> pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id));

Sure. Will fix this. 
> 
> ??
> 
> > +}
> > +
> > +
> 
> remove extra blank line.
> 
> > +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = {
> > +   .attr = { .name = "cpuinfo_nominal_freq",
> > + .mode = 0444,
> > +   },
> 
> Align {}

Probably the use of ATTR_RO(cpuinfo_nominal_freq) and renaming
show_cpuinfo_nominal_freq to cpuinfo_nominal_freq_show() would be even
better. What do you think ?

> 
> > +   .show = show_cpuinfo_nominal_freq,
> > +};
> > +
> > +
> >  static struct freq_attr *powernv_cpu_freq_attr[] = {
> > &cpufreq_freq_attr_scaling_available_freqs,
> > +   &cpufreq_freq_attr_cpuinfo_nominal_freq,
> > NULL,
> >  };

This needs to be rewritten to include the entries of cpufreq_generic_attr.
>

Thanks for the review.

--
Regards
gautham. 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform

2014-03-21 Thread Gautham R Shenoy

Hi Viresh,

On Fri, Mar 21, 2014 at 02:11:32PM +0530, Viresh Kumar wrote:
> On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy
>  wrote:
> > From: Vaidyanathan Srinivasan 
> 
> Hi Vaidy,
> 
> > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> > index 4b029c0..4ba1632 100644
> > --- a/drivers/cpufreq/Kconfig
> > +++ b/drivers/cpufreq/Kconfig
> > @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS
> >  choice
> > prompt "Default CPUFreq governor"
> > default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || 
> > ARM_SA1110_CPUFREQ
> > +   default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ
> 
> Probably we should remove SA1100's entry as well from here. This is
> not the right way of doing it. Imagine 100 platforms having entries here.
> If you want it, then select it from your platforms Kconfig.

Sure. Will move these bits and the other governor related bits to the
Powernv Kconfig.

> > diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> > b/drivers/cpufreq/powernv-cpufreq.c
> > new file mode 100644
> > index 000..ab1551f
> > --- /dev/null
> > +
> > +#define pr_fmt(fmt)"powernv-cpufreq: " fmt
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> 
> That's it? Sure?
> 
> Even if things compile for you, you must explicitly include all the
> files on which
> you depend.

Ok. 

> 
> > +
> > +   WARN_ON(len_ids != len_freqs);
> > +   nr_pstates = min(len_ids, len_freqs) / sizeof(u32);
> > +   WARN_ON(!nr_pstates);
> 
> Why do you want to continue here?

Good point. We might be better off exiting at this point. 

> 
> > +   pr_debug("NR PStates %d\n", nr_pstates);
> > +   for (i = 0; i < nr_pstates; i++) {
> > +   u32 id = be32_to_cpu(pstate_ids[i]);
> > +   u32 freq = be32_to_cpu(pstate_freqs[i]);
> > +
> > +   pr_debug("PState id %d freq %d MHz\n", id, freq);
> > +   powernv_freqs[i].driver_data = i;
> 
> I don't think you are using this field at all and this is the field you can
> use for driver_data and so you can get rid of powernv_pstate_ids[ ].

Using driver_data to record powernv_pstate_ids won't work since
powernv_pstate_ids can be negative. So a pstate_id -3 can be confused
with CPUFREQ_BOOST_FREQ thereby not displaying the frequency
corresponding to pstate id -3. So for now I think we will be retaining
powernv_pstate_ids.

> 
> > +   powernv_freqs[i].frequency = freq * 1000; /* kHz */
> > +   powernv_pstate_ids[i] = id;
> > +   }
> > +   /* End of list marker entry */
> > +   powernv_freqs[i].driver_data = 0;
> 
> Not required.

Ok.
> 
> > +   powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
> > +
> > +   /* Print frequency table */
> > +   for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
> > +   pr_debug("%d: %d\n", i, powernv_freqs[i].frequency);
> 
> You have already printed this table earlier..

Fair enough.

> 
> > +
> > +   return 0;
> > +}
> > +
> > +static struct freq_attr *powernv_cpu_freq_attr[] = {
> > +   &cpufreq_freq_attr_scaling_available_freqs,
> > +   NULL,
> > +};
> 
> Can use this instead: cpufreq_generic_attr?

In this patch yes. But later patch introduces an additional attribute
for displaying the nominal frequency. Will handle that part in a clean
way in the next version.

> 
> > +/* Helper routines */
> > +
> > +/* Access helpers to power mgt SPR */
> > +
> > +static inline unsigned long get_pmspr(unsigned long sprn)
> 
> Looks big enough not be inlined?

It is called from one function. It has been defined separately for
readability. 

> 
> > +{
> > +   switch (sprn) {
> > +   case SPRN_PMCR:
> > +   return mfspr(SPRN_PMCR);
> > +
> > +   case SPRN_PMICR:
> > +   return mfspr(SPRN_PMICR);
> > +
> > +   case SPRN_PMSR:
> > +   return mfspr(SPRN_PMSR);
> > +   }
> > +   BUG();
> > +}
> > +
> > +static inline void set_pmspr(unsigned long sprn, unsigned long val)
> > +{
> 
> Same here..

Same reason as above.

> 
> > +   switch (sprn) {
> > +   case SPRN_PMCR:
> > +   mtspr(SPRN_PMCR, val);
> > +   return;
> > +
> > +   case SPRN_PMICR:
> > +   mtspr(SPRN_PMICR, val);
> > +   return;

Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method

2014-03-21 Thread Gautham R Shenoy

On Fri, Mar 21, 2014 at 03:01:29PM +0530, Viresh Kumar wrote:
> On Thu, Mar 20, 2014 at 5:41 PM, Gautham R. Shenoy
>  wrote:
> > From: "Gautham R. Shenoy" 
> >
> > The current frequency of a cpu is reported through the sysfs file
> > cpuinfo_cur_freq. This requires the driver to implement a
> > "->get(unsigned int cpu)" method which will return the current
> > operating frequency.
> >
> > Implement a function named powernv_cpufreq_get() which reads the local
> > pstate from the PMSR and returns the corresponding frequency.
> >
> > Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get().
> >
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> >  drivers/cpufreq/powernv-cpufreq.c | 38 
> > ++
> >  1 file changed, 38 insertions(+)
> 
> Please merge these fixups with the first patch which is creating the driver.
> I understand that a different guy has created this patch and so wanted
> to have a patch on his name but its really difficult to review this
way.

Heh! Well, that wasn't the reason why this was sent out as a separate
patch, but never mind. Though I don't understand why it would be
difficult to review the patch though.

> Better add your signed-off in the first patch instead. Sending such changes
> once the driver is mainlined looks fine.

Sure, this makes sense.

> 
> > diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> > b/drivers/cpufreq/powernv-cpufreq.c
> > index 46bee8a..ef6ed8c 100644
> > --- a/drivers/cpufreq/powernv-cpufreq.c
> > +++ b/drivers/cpufreq/powernv-cpufreq.c
> > @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, 
> > unsigned long val)
> > BUG();
> >  }
> >
> > +/*
> > + * Computes the current frequency on this cpu
> > + * and stores the result in *ret_freq.
> > + */
> > +static void powernv_read_cpu_freq(void *ret_freq)
> > +{
> > +   unsigned long pmspr_val;
> > +   s8 local_pstate_id;
> > +   int *cur_freq, freq, pstate_id;
> > +
> > +   cur_freq = (int *)ret_freq;
> 
> You don't need cur_freq variable at all..

I don't like it either. But the compiler complains without this hack
:-(

> 
> > +   pmspr_val = get_pmspr(SPRN_PMSR);
> > +
> > +   /* The local pstate id corresponds bits 48..55 in the PMSR.
> > + * Note: Watch out for the sign! */
> > +   local_pstate_id = (pmspr_val >> 48) & 0xFF;
> > +   pstate_id = local_pstate_id;
> 
> similarly local_pstate_id

well, I am interested in the bits 48..55 of pmspr_val. That's the
pstate_id which can be negative. So I'll like to keep
local_pstate_id. 
> 
> > +
> > +   freq = pstate_id_to_freq(pstate_id);
> > +   pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n",
> > +   smp_processor_id(), pmspr_val, pstate_id, freq);
> > +   *cur_freq = freq;
> 
> Move above print here after a blank line. Also remove freq variable as
> well and use *cur_freq directly.. And then you can rename it to freq
as well.

We can get rid of freq and use *cur_freq in its place.


Thanks for the detailed review.

--
Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method

2014-03-21 Thread Gautham R Shenoy

On Fri, Mar 21, 2014 at 12:01:07PM +, David Laight wrote:
> From: Viresh Kumar
>  
> > On 21 March 2014 16:34, Gautham R Shenoy  wrote:
> > > Heh! Well, that wasn't the reason why this was sent out as a separate
> > > patch, but never mind. Though I don't understand why it would be
> > > difficult to review the patch though.
> > 
> > Because the initial driver wasn't complete earlier. There were 2-3 patches
> > after the first one which are changing what the first patch has added.
> > Nothing else :)
> > 
> > >> > +static void powernv_read_cpu_freq(void *ret_freq)
> > >> > +{
> > >> > +   unsigned long pmspr_val;
> > >> > +   s8 local_pstate_id;
> > >> > +   int *cur_freq, freq, pstate_id;
> > >> > +
> > >> > +   cur_freq = (int *)ret_freq;
> > >>
> > >> You don't need cur_freq variable at all..
> > >
> > > I don't like it either. But the compiler complains without this hack
> > > :-(
> > 
> > Why would the compiler warn for doing this?:
> > 
> > *(int *)ret_freq = freq;
> 
> Because it is very likely to be wrong.
> In general casts of pointers to integer types are dangerous.
> In this case why not make the function return the value?

Because this function is called via an smp_call_function(). And we
need a way of returning the value to the caller.

> 
>   David
> 
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method

2014-03-21 Thread Gautham R Shenoy

On Fri, Mar 21, 2014 at 05:15:34PM +0530, Viresh Kumar wrote:
> On 21 March 2014 16:34, Gautham R Shenoy  wrote:
> > Heh! Well, that wasn't the reason why this was sent out as a separate
> > patch, but never mind. Though I don't understand why it would be
> > difficult to review the patch though.
> 
> Because the initial driver wasn't complete earlier. There were 2-3 patches
> after the first one which are changing what the first patch has added.
> Nothing else :)
> 
> >> > +static void powernv_read_cpu_freq(void *ret_freq)
> >> > +{
> >> > +   unsigned long pmspr_val;
> >> > +   s8 local_pstate_id;
> >> > +   int *cur_freq, freq, pstate_id;
> >> > +
> >> > +   cur_freq = (int *)ret_freq;
> >>
> >> You don't need cur_freq variable at all..
> >
> > I don't like it either. But the compiler complains without this hack
> > :-(
> 
> Why would the compiler warn for doing this?:
> 
> *(int *)ret_freq = freq;

The compiler doesn't complain if I do this.
> 
> >> > +   pmspr_val = get_pmspr(SPRN_PMSR);
> >> > +
> >> > +   /* The local pstate id corresponds bits 48..55 in the PMSR.
> >> > + * Note: Watch out for the sign! */
> >> > +   local_pstate_id = (pmspr_val >> 48) & 0xFF;
> >> > +   pstate_id = local_pstate_id;
> >>
> >> similarly local_pstate_id
> >
> > well, I am interested in the bits 48..55 of pmspr_val. That's the
> > pstate_id which can be negative. So I'll like to keep
> > local_pstate_id.
> 
> Can you please explain at bit level how getting the value in a s8
> first and then into a s32 will help you here? Instead of getting it
> directly into s32.

Consider the case when pmspr = 0x00fe;

We are interested in extracting the value 'fe'. And ensure that when
we store this value into an int, we get the sign extension right.

So the following doesn't work: 

   pstate_id = (pmspr_val >> 48) & 0x;

Since we get pstate_id = 0x00fe, which is not the same as
0xfffe.

Of course, we could do the following, but I wasn't sure if
two levels of type casting helped on the readability front.

   pstate_id = (int) ((s8)((pmspr_val >> 48) & 0xFF));

--
Thanks and Regards
gautham.
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform

2014-03-21 Thread Gautham R Shenoy

Hi Viresh,

On Fri, Mar 21, 2014 at 04:24:27PM +0530, Viresh Kumar wrote:
> On 21 March 2014 16:13, Gautham R Shenoy  wrote:
> > On Fri, Mar 21, 2014 at 02:11:32PM +0530, Viresh Kumar wrote:
> 
> >> > +   pr_debug("PState id %d freq %d MHz\n", id, freq);
> >> > +   powernv_freqs[i].driver_data = i;
> >>
> >> I don't think you are using this field at all and this is the field you can
> >> use for driver_data and so you can get rid of powernv_pstate_ids[ ].
> >
> > Using driver_data to record powernv_pstate_ids won't work since
> > powernv_pstate_ids can be negative. So a pstate_id -3 can be confused
> > with CPUFREQ_BOOST_FREQ thereby not displaying the frequency
> > corresponding to pstate id -3. So for now I think we will be retaining
> > powernv_pstate_ids.
> 
> As I said earlier, this field is only used by cpufreq drivers and cpufreq core
> isn't supposed to touch it. CPUFREQ_BOOST_FREQ and other macros
> are there for .frequency field and not this one.
> 

Ok, I had based my code on linus's git tree. I checked the 'pm-cpufreq'
branch of Rafael's 'linux-pm' tree and freq_table.c contains the
following code snippet in show_available_frequencies:

if (show_boost ^ (table[i].driver_data == CPUFREQ_BOOST_FREQ))
continue;

I suspect we're talking about different code bases. So could you
please tell me which one should I be looking at ?

> 
> >> > +static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
> >> > +{
> >> > +   int base, i;
> >> > +
> >> > +#ifdef CONFIG_SMP
> >>
> >> What will break if you don't have this ifdef here? Without that as well
> >> below code should work?
> 
> Missed this one?
> 
> >> > +   base = cpu_first_thread_sibling(policy->cpu);
> >> > +
> >> > +   for (i = 0; i < threads_per_core; i++)
> >> > +   cpumask_set_cpu(base + i, policy->cpus);
> >> > +#endif
> >> > +   policy->cpuinfo.transition_latency = 25000;
> >> > +
> >> > +   policy->cur = powernv_freqs[0].frequency;
> >> > +   cpufreq_frequency_table_get_attr(powernv_freqs, policy->cpu);
> >>
> >> This doesn't exist anymore.
> >
> > Didn't get this comment!
> 
> cpufreq_frequency_table_get_attr() routine doesn't exist anymore.
> 
> >> > +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> >> > +{
> >> > +   cpufreq_frequency_table_put_attr(policy->cpu);
> >> > +   return 0;
> >> > +}
> >>
> >> You don't need this..
> >
> > Why not ?
> 
> Because cpufreq_frequency_table_get_attr() and
> cpufreq_frequency_table_put_attr() don't exist anymore :)

Well, it does in the codebases that I was looking at. May be I've been
looking at the wrong place.

> 
> >> > +module_init(powernv_cpufreq_init);
> >> > +module_exit(powernv_cpufreq_exit);
> >>
> >> Place these right after the routines without any blank lines in
> > between.
> >
> > Is this the new convention ?
> 
> Don't know I have been following this since long time, probably from the
> time I started with Mainline stuff.. I have seen many maintainers advising 
> this
> as it make code more readable, specially if these routines are quite big..
> 
> Probably it isn't mentioned in coding guidelines but people follow it most of
> the times.

Ok. I was not aware of this. Good to know :-)
> 

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method

2014-03-21 Thread Gautham R Shenoy

On Fri, Mar 21, 2014 at 06:42:50PM +0530, Viresh Kumar wrote:
> On 21 March 2014 18:34, Gautham R Shenoy  wrote:
> > Consider the case when pmspr = 0x00fe;
> >
> > We are interested in extracting the value 'fe'. And ensure that when
> > we store this value into an int, we get the sign extension right.
> >
> > So the following doesn't work:
> >
> >pstate_id = (pmspr_val >> 48) & 0x;
> 
> What about pstate_id = (pmspr_val >> 48) & 0xFF; ??

Nope. 

Suppose the pmspr_val is contained in the register 
%rax, and pstate_id corresponds to the address -20(%rbp) then:

pstate_id = (pmspr_val >> 48) & 0xFF;

would corrspond to 

shrq$48, %rax// Left shift by 48 bits
andl$255, %eax   // Mask the lower 32 bits of %rax with 
0x00FF
movl%eax, -20(%rbp)  // Store the lower 32 bits of %rax
into pstate_id

On the other hand, 

   pstate_id = (int)((s8)((pmspr_val >> 48) & 0xFF));

would correspond to:

shrq$48, %rax  // Left shift by 48 bits.
movsbl  %al, %eax  // Move the lower 8 bits of %rax to %eax
  with sign-extension. 
movl%eax, -20(%rbp) // store the result in pstate_id;

So with this, we are getting the sign extension due to the use of movsbl.

And if local_pstate_id corresponds to the address -1(%rbp) then:

   local_pstate_id = (pmspr_val >> 48) & 0xFF;
   pstate_id = local_pstate_id;

would correspond to:

shrq$48, %rax// Left shift by 48 bits
movb%al, -1(%rbp)//copy the lower 8 bits to local_pstate_id
movsbl  -1(%rbp), %eax   //move the contents of
   local_pstate_id to %eax with sign extension.
movl%eax, -20(%rbp)  // Store the result in pstate_id

Hope this helps :-)

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform

2014-03-21 Thread Gautham R Shenoy

On Fri, Mar 21, 2014 at 04:24:27PM +0530, Viresh Kumar wrote:
> >> > +static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
> >> > +{
> >> > +   int base, i;
> >> > +
> >> > +   base = cpu_first_thread_sibling(policy->cpu);
> >> > +
> >> > +   for (i = 0; i < threads_per_core; i++)
> >> > +   cpumask_set_cpu(base + i, policy->cpus);
> >> > +   policy->cpuinfo.transition_latency = 25000;
> >> > +
> >> > +   policy->cur = powernv_freqs[0].frequency;
> >> > +   cpufreq_frequency_table_get_attr(powernv_freqs, policy->cpu);
> >>
> >> This doesn't exist anymore.

Ok, I guess the right thing to do at this point is call

   cpufreq_table_validate_and_show(policy, powernv_freqs);

Will fix the code to take care of this.
--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method

2014-03-22 Thread Gautham R Shenoy

Hi Ben,

On Sat, Mar 22, 2014 at 09:56:30AM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2014-03-21 at 16:34 +0530, Gautham R Shenoy wrote:
> 
> > > >
> > > > +/*
> > > > + * Computes the current frequency on this cpu
> > > > + * and stores the result in *ret_freq.
> > > > + */
> > > > +static void powernv_read_cpu_freq(void *ret_freq)
> > > > +{
> > > > +   unsigned long pmspr_val;
> > > > +   s8 local_pstate_id;
> > > > +   int *cur_freq, freq, pstate_id;
> > > > +
> > > > +   cur_freq = (int *)ret_freq;
> > > 
> > > You don't need cur_freq variable at all..
> > 
> > I don't like it either. But the compiler complains without this hack
> > :-(
> 
> Casting integers into void * is a recipe for disaster... what is that
> supposed to be about ?

Like I mentioned elsewhere on this thread, we're calling
powernv_read_cpu_freq via an smp_call_function(). We use this to
obtain the frequency on the cpu where powernv_read_cpu_freq
executes and return it to the caller of smp_call_function.

> We lose all type checking and get exposed
> to endian issues etc... the day somebody uses a different type on both
> sides.
> 
Yes, I understand the problem now. I'll think of a safer way to pass
the return value.

> Also is "freq" a frequency ? In this case an int isn't big enough.

freq is the frequency stored in the cpufreq_table. The value is in
kHz. So, int should be big enough.

> Cheers,
> Ben.
> 
>

--
Thanks and Regards
gautham. 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v4] powernv: Dynamic Frequency Scaling Enablement

2014-03-26 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

This is the v4 of the patchset to enable Dynamic Frequency Scaling
on IBM PowerNV Platforms. I have incorporated the review comments
from the previous version (can be found at [1]). 

In this version, 

  * Multiple patches from the previous version have been into a single
patch, since the higher numbered patches implemented some helper
functions and the driver methods which should have been a part of
the first patch to begin with.

  * Use the generic helpers provided by the cpufreq core available in
the latest linux-next tree.

  * Fix the code to avoid casting integer pointers to void.

The patch is based on top of the commit:06ed26d1de59ce7cbbe68378b7e470be169750e5
of the linux-next tree.


[1]: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg76675.html

Vaidyanathan Srinivasan (1):
  powernv, cpufreq: cpufreq driver for powernv platform

 arch/powerpc/configs/pseries_defconfig|   1 +
 arch/powerpc/configs/pseries_le_defconfig |   1 +
 arch/powerpc/include/asm/reg.h|   4 +
 arch/powerpc/platforms/powernv/Kconfig|   6 +
 drivers/cpufreq/Kconfig.powerpc   |   8 +
 drivers/cpufreq/Makefile  |   1 +
 drivers/cpufreq/powernv-cpufreq.c | 372 ++
 7 files changed, 393 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform

2014-03-26 Thread Gautham R. Shenoy

From: Vaidyanathan Srinivasan 

Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

The code was originally written by sva...@linux.vnet.ibm.com. Over
time it was modified to accomodate bug-fixes as well as updates to the
the cpu-freq core. Relevant portions of the change logs corresponding
to those modifications are noted below:

 * The policy->cpus needs to be populated in a hotplug-invariant
   manner instead of using cpu_sibling_mask() which varies with
   cpu-hotplug. This is because the cpufreq core code copies this
   content into policy->related_cpus mask which is should not vary on
   cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com]

 * On POWER systems, the CPU frequency is controlled at a core-level
   and hence we need to serialize so that only one of the threads in
   the core switches the core's frequency at a time. Introduce
   per-core locking to enable finer-grained synchronization and
   thereby enhance the speed and responsiveness of the cpufreq driver
   to varying workload demands.

 The design of per-core locking is very simple and
   straight-forward: we first define a Per-CPU lock and use the ones
   that belongs to the first thread sibling of the core.

 cpu_first_thread_sibling() macro is used to find the *common*
   lock for all thread siblings belonging to a core. [Authored by
   srivatsa.b...@linux.vnet.ibm.com]

 * Create a helper routine that can return the cpu-frequency for the
   corresponding pstate_id. Also, cache the values of the pstate_max,
   pstate_min and pstate_nominal and nr_pstates in a static structure
   so that they can be reused in the future to perform any
   validations. [Authored by e...@linux.vnet.ibm.com]

 * Create a driver attribute named cpuinfo_nominal_freq which creates
   a sysfs read-only file named cpuinfo_nominal_freq. Export the
   frequency corresponding to the nominal_pstate through this
   interface.

 Nominal frequency is the highest non-turbo frequency for the
   platform.  This is generally used for setting governor policies
   from user space for optimal energy efficiency. [Authored by
   e...@linux.vnet.ibm.com]

 * Implement a powernv_cpufreq_get(unsigned int cpu) method which will
   return the current operating frequency. Export this via the sysfs
   interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to
   powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com]

[Change log updated by e...@linux.vnet.ibm.com]

Reviewed-by: Preeti U Murthy 
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Anton Blanchard 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/configs/pseries_defconfig|   1 +
 arch/powerpc/configs/pseries_le_defconfig |   1 +
 arch/powerpc/include/asm/reg.h|   4 +
 arch/powerpc/platforms/powernv/Kconfig|   6 +
 drivers/cpufreq/Kconfig.powerpc   |   8 +
 drivers/cpufreq/Makefile  |   1 +
 drivers/cpufreq/powernv-cpufreq.c | 372 ++
 7 files changed, 393 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index e9a8b4e..a285d44 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -353,3 +353,4 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
 CONFIG_VIRTUALIZATION=y
 CONFIG_KVM_BOOK3S_64=m
 CONFIG_KVM_BOOK3S_64_HV=y
+CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
diff --git a/arch/powerpc/configs/pseries_le_defconfig 
b/arch/powerpc/configs/pseries_le_defconfig
index 62771e0..47e6161 100644
--- a/arch/powerpc/configs/pseries_le_defconfig
+++ b/arch/powerpc/configs/pseries_le_defconfig
@@ -350,3 +350,4 @@ CONFIG_CRYPTO_LZO=m
 # CONFIG_CRYPTO_ANSI_CPRNG is not set
 CONFIG_CRYPTO_DEV_NX=y
 CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
+CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 90c06ec..84f92ca 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
 #define SPRN_IC0x350   /* Virtual Instruction Count */
 #define SPRN_VTB   0x351   /* Virtual Time Base */
+#define SPRN_PMICR 0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR  0x355   /* Power Management Status Reg */
+#define SPRN_PMCR  0x374   /* Power Management Control Regist

Re: [PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform

2014-03-27 Thread Gautham R Shenoy

On Thu, Mar 27, 2014 at 12:09:53PM +0530, Viresh Kumar wrote:
> Cc'ing Rafael.
>

Thanks. It was a lapse on my part by not Cc'ing Rafael in the first
place.
 
> On 26 March 2014 22:25, Gautham R. Shenoy  wrote:
> > From: Vaidyanathan Srinivasan 
> >
> > Backend driver to dynamically set voltage and frequency on
> > IBM POWER non-virtualized platforms.  Power management SPRs
> > are used to set the required PState.
> >
> > This driver works in conjunction with cpufreq governors
> > like 'ondemand' to provide a demand based frequency and
> > voltage setting on IBM POWER non-virtualized platforms.
> >
> > PState table is obtained from OPAL v3 firmware through device
> > tree.
> >
> > powernv_cpufreq back-end driver would parse the relevant device-tree
> > nodes and initialise the cpufreq subsystem on powernv platform.
> >
> > The code was originally written by sva...@linux.vnet.ibm.com. Over
> > time it was modified to accomodate bug-fixes as well as updates to the
> > the cpu-freq core. Relevant portions of the change logs corresponding
> > to those modifications are noted below:
> >
> >  * The policy->cpus needs to be populated in a hotplug-invariant
> >manner instead of using cpu_sibling_mask() which varies with
> >cpu-hotplug. This is because the cpufreq core code copies this
> >content into policy->related_cpus mask which is should not vary on
> 
> s/is /

Good catch! Will fix this

[...]
> 
> >  * On POWER systems, the CPU frequency is controlled at a core-level
> >and hence we need to serialize so that only one of the threads in
> >the core switches the core's frequency at a time. Introduce
> >per-core locking to enable finer-grained synchronization and
> >thereby enhance the speed and responsiveness of the cpufreq driver
> >to varying workload demands.
> >
> >  The design of per-core locking is very simple and
> >straight-forward: we first define a Per-CPU lock and use the ones
> >that belongs to the first thread sibling of the core.
> >
> >  cpu_first_thread_sibling() macro is used to find the *common*
> >lock for all thread siblings belonging to a core. [Authored by
> >srivatsa.b...@linux.vnet.ibm.com]
> 
> We don't need that after serialization patch of core is accepted. And it
> should be accepted soon, in one form or other.

As of now, I prefer this patch be based on code that is in the -next
tree. I'll get rid of the per-core locking once the serialization
patch of the core is accepted.

[...]
> > --- a/arch/powerpc/configs/pseries_defconfig
> > +++ b/arch/powerpc/configs/pseries_defconfig
> > @@ -353,3 +353,4 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
> >  CONFIG_VIRTUALIZATION=y
> >  CONFIG_KVM_BOOK3S_64=m
> >  CONFIG_KVM_BOOK3S_64_HV=y
> > +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
> > diff --git a/arch/powerpc/configs/pseries_le_defconfig 
> > b/arch/powerpc/configs/pseries_le_defconfig
> > index 62771e0..47e6161 100644
> > --- a/arch/powerpc/configs/pseries_le_defconfig
> > +++ b/arch/powerpc/configs/pseries_le_defconfig
> > @@ -350,3 +350,4 @@ CONFIG_CRYPTO_LZO=m
> >  # CONFIG_CRYPTO_ANSI_CPRNG is not set
> >  CONFIG_CRYPTO_DEV_NX=y
> >  CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
> > +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
> 
> don't know how Rafael want this, but probably this part could have gone
> through ppc tree..
> 

That would mean multiple patches :-)

> > +
> > +#define pr_fmt(fmt)"powernv-cpufreq: " fmt
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> 
> I thought I gave a comment on missing headers here?

Ok, so these seem to be the missing ones.

#include 
#include 
#include 
#include 
#include 

#include 

Have I missed anything else ?


> > +static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
> > +static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
> 
> I though we don't need this anymore? Either Rafael will take my patch as
> is for the BOOST fixup or we will end up creating .isboost field in
> struct cpufreq_frequency_table and so you could have used .driver_data here.
> 

I mentioned in my reply to your BOOST fixup patch that we would like
pstate_ids to be of the type "signed int", while the .driver_data is
an "unsigned int". If we end up using .driver_data, we would have to
be careful and not use it directly but reassign it to a signed int (or
typecast it) before using it.

Not just that, the pr_debugs in the core which are printed during
fr

Re: [PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform

2014-03-27 Thread Gautham R Shenoy

On Thu, Mar 27, 2014 at 03:29:36PM +0530, Viresh Kumar wrote:
> On 27 March 2014 15:00, Gautham R Shenoy  wrote:
> > As of now, I prefer this patch be based on code that is in the -next
> > tree. I'll get rid of the per-core locking once the serialization
> > patch of the core is accepted.
> 
> Okay.. Then divide this patch into two parts, second one doing all
> the serialization stuff and I will review only the first one from V5 :)
> 
> Maybe mention that in Patch2 as well, that once serialization patches
> are accepted, its not required.

Sure. This would be a good idea.

> 
> >> don't know how Rafael want this, but probably this part could have gone
> >> through ppc tree..
> >
> > That would mean multiple patches :-)
> 
> I wasn't opposed to multiple patches at all, but multiple patches for
> basic functionality of same driver file. Otherwise separating things
> into patches is the best receipe.

Fair enough!

> 
> >> > +#define pr_fmt(fmt)"powernv-cpufreq: " fmt
> >> > +
> >> > +#include 
> >> > +#include 
> >> > +#include 
> >> > +#include 
> >> > +#include 
> >>
> >> I thought I gave a comment on missing headers here?
> >
> > Ok, so these seem to be the missing ones.
> >
> > #include 
> > #include 
> > #include 
> 
> This may shift to Patch 2 ..

Ok.

> 
> > #include 
> > #include 
> >
> > #include 
> >
> > Have I missed anything else ?
> 
> Not sure :) .. I will see if I can find anymore..
>

Cool.
 
> >> > +static struct cpufreq_frequency_table 
> >> > powernv_freqs[POWERNV_MAX_PSTATES+1];
> >> > +static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
> >>
> >> I though we don't need this anymore? Either Rafael will take my patch as
> >> is for the BOOST fixup or we will end up creating .isboost field in
> >> struct cpufreq_frequency_table and so you could have used .driver_data 
> >> here.
> >>
> >
> > I mentioned in my reply to your BOOST fixup patch that we would like
> > pstate_ids to be of the type "signed int", while the .driver_data is
> > an "unsigned int". If we end up using .driver_data, we would have to
> > be careful and not use it directly but reassign it to a signed int (or
> > typecast it) before using it.
> 
> Probably many people would want it to be unsigned it, so that they
> can put some strange hex values there..
> 
> Which part of your code conflicts with this right now? I can see only
> this line in powernv_set_freq()

I am not denying the advantage of storing pstate_ids in .driver_data
since we will then have all the information in one place.
(That said, in the future if we want to export additional
information, say pertaining to the voltage, we will have to keep a
separate array anyway :-) )

However, as of now, I am wary about reusing a variable provided by
the core, whose type is fixed (which is not the type I am interested
in) and whose interpretation is ambiguous (since it is also being used
as a BOOST indicator).

When we have a .is_boost field, or a flags field in the cpufreq_table,
and once the .driver_data field is completely opaque, I would be
comfortable making use of .driver_data to store the pstate_ids. So
I'll send out those patches once the fixes are present in the
core. For now, let this code stay as it is. I think we should be ok
with the additional overhead of 1kb as of now.

> 
> freq_data.pstate_id = powernv_pstate_ids[new_index];
> 
> And I don't think we will have side effects here.
> 
> > Not just that, the pr_debugs in the core which are printed during
> > frequency transitions will then end up printing large positive values
> > (since it will interpret negative pstate_ids as unsigned int) in the
> > place of pstate_ids, which would not be particularly useful.
> 
> I have checked it again, those prints shouldn't have used that field.
> Will fix them.
> 
> And so you can use that field in your code and I will get core fixed
> for you..

Like I said earlier, once the core is fixed, I'll be comfortable using
the .driver_data field for the purpose I need. Not until then :-)

> 
> >> Why do you need to get these from DT? And not find that yourself here 
> >> instead?
> >
> > Well, I think we can obtain a more accurate value from the DT which
> > would have already computed it. But I'll let vaidy answer this.
> 
> Can we simply refer to frequency values here for finding min and max
> pstates? If yes, then probably this is the better place as there can be
> h

[PATCH v5 0/3] powernv,cpufreq: Dynamic Frequency Scaling support

2014-03-28 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

This is v5 of the patchset to enable dynamic frequency scaling on IBM
PowerNV platforms. This patchset does address all the review comments
obtained for v4 (which can be found at [1]).

Changes from v4:

* Created a separate patch to select the CPUFreq related Config
  options for PowerNV

* Dropped the per-core locking hunks in
  drivers/cpufreq/powernv-cpufreq.c since the CPUFreq core takes
  care of the for us after the following commit which is present in
  linux-next:

commit 12478cf0c55e5969f740bb38a24b1a0104ae18d8   
Author: Srivatsa S. Bhat 
Date:   Mon Mar 24 13:35:44 2014 +0530

cpufreq: Make sure frequency transitions are serialized

* [PATCH v5 3/3] gets rid of the powernv_pstate_ids[] array that
  was being used to record the pstate ids. After the following
  patch it is safe to use cpufreq_frequency_table.driver_data
  since it is opaque to the cpufreq core: 

From: Viresh Kumar 
Date: 2014-03-28 13:53:47
url: http://marc.info/?l=linux-pm&m=139601416804702&w=2

cpufreq: create another field .flags in cpufreq_frequency_table

The patchset is based on the commit
201544be8c37dffbf069bb5fc9edb5674f8c1754 of the linux-next tree.

While all the patches in the patchset apply cleanly on linux-next,
[PATCH v5 3/3] requires the Viresh's patch that was mentioned
above. Otherwise, the frequency corresponding to pstate id -3 will be
omited while reporting the "scaling_available_frequencies" in sysfs.

[1]: http://marc.info/?l=linux-pm&m=139585297620612&w=2

Gautham R. Shenoy (2):
  powernv, cpufreq: Select CPUFreq related Kconfig options for powernv
  powernv,cpufreq: Use cpufreq_frequency_table.driver_data to store
pstate ids

Vaidyanathan Srinivasan (1):
  powernv, cpufreq: cpufreq driver for powernv platform

 arch/powerpc/configs/pseries_defconfig|   1 +
 arch/powerpc/configs/pseries_le_defconfig |   1 +
 arch/powerpc/include/asm/reg.h|   4 +
 arch/powerpc/platforms/powernv/Kconfig|   6 +
 drivers/cpufreq/Kconfig.powerpc   |   8 +
 drivers/cpufreq/Makefile  |   1 +
 drivers/cpufreq/powernv-cpufreq.c | 341 ++
 7 files changed, 362 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v5 2/3] powernv, cpufreq: cpufreq driver for powernv platform

2014-03-28 Thread Gautham R. Shenoy

From: Vaidyanathan Srinivasan 

Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

The code was originally written by sva...@linux.vnet.ibm.com. Over
time it was modified to accomodate bug-fixes as well as updates to the
the cpu-freq core. Relevant portions of the change logs corresponding
to those modifications are noted below:

 * The policy->cpus needs to be populated in a hotplug-invariant
   manner instead of using cpu_sibling_mask() which varies with
   cpu-hotplug. This is because the cpufreq core code copies this
   content into policy->related_cpus mask which should not vary on
   cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com]

 * Create a helper routine that can return the cpu-frequency for the
   corresponding pstate_id. Also, cache the values of the pstate_max,
   pstate_min and pstate_nominal and nr_pstates in a static structure
   so that they can be reused in the future to perform any
   validations. [Authored by e...@linux.vnet.ibm.com]

 * Create a driver attribute named cpuinfo_nominal_freq which creates
   a sysfs read-only file named cpuinfo_nominal_freq. Export the
   frequency corresponding to the nominal_pstate through this
   interface.

 Nominal frequency is the highest non-turbo frequency for the
   platform.  This is generally used for setting governor policies
   from user space for optimal energy efficiency. [Authored by
   e...@linux.vnet.ibm.com]

 * Implement a powernv_cpufreq_get(unsigned int cpu) method which will
   return the current operating frequency. Export this via the sysfs
   interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to
   powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com]

[Change log updated by e...@linux.vnet.ibm.com]

Reviewed-by: Preeti U Murthy 
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Anton Blanchard 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/reg.h|   4 +
 drivers/cpufreq/Kconfig.powerpc   |   8 +
 drivers/cpufreq/Makefile  |   1 +
 drivers/cpufreq/powernv-cpufreq.c | 342 ++
 4 files changed, 355 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1a36b8e..2189f8f 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
 #define SPRN_IC0x350   /* Virtual Instruction Count */
 #define SPRN_VTB   0x351   /* Virtual Time Base */
+#define SPRN_PMICR 0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR  0x355   /* Power Management Status Reg */
+#define SPRN_PMCR  0x374   /* Power Management Control Register */
+
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_TAR_LG8   /* Enable Target Address Register */
 #define FSCR_EBB_LG7   /* Enable Event Based Branching */
diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
index ca0021a..72564b7 100644
--- a/drivers/cpufreq/Kconfig.powerpc
+++ b/drivers/cpufreq/Kconfig.powerpc
@@ -54,3 +54,11 @@ config PPC_PASEMI_CPUFREQ
help
  This adds the support for frequency switching on PA Semi
  PWRficient processors.
+
+config POWERNV_CPUFREQ
+   tristate "CPU frequency scaling for IBM POWERNV platform"
+   depends on PPC_POWERNV
+   default y
+   help
+This adds support for CPU frequency switching on IBM POWERNV
+platform
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 7494565..0dbb963 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC64)  += pmac64-cpufreq.o
 obj-$(CONFIG_PPC_PASEMI_CPUFREQ)   += pasemi-cpufreq.o
+obj-$(CONFIG_POWERNV_CPUFREQ)  += powernv-cpufreq.o
 
 
##
 # Other platform drivers
diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
new file mode 100644
index 000..e1e5197
--- /dev/null
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -0,0 +1,342 @@
+/*
+ * POWERNV cpufreq driver for the IBM POWER processors
+ *
+ * (C) Copyright IBM 2014
+ *
+ * Autho

[PATCH v5 1/3] powernv, cpufreq: Select CPUFreq related Kconfig options for powernv

2014-03-28 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Enable CPUFreq for PowerNV. Select "performance", "powersave",
"userspace" and "ondemand" governors. Choose "ondemand" to be the
default governor.

Signed-off-by: Gautham R. Shenoy 
Signed-off-by: Srivatsa S. Bhat 
---
 arch/powerpc/configs/pseries_defconfig| 1 +
 arch/powerpc/configs/pseries_le_defconfig | 1 +
 arch/powerpc/platforms/powernv/Kconfig| 6 ++
 3 files changed, 8 insertions(+)

diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index 9ea8342b..a905063 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -306,3 +306,4 @@ CONFIG_KVM_BOOK3S_64=m
 CONFIG_KVM_BOOK3S_64_HV=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
+CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
diff --git a/arch/powerpc/configs/pseries_le_defconfig 
b/arch/powerpc/configs/pseries_le_defconfig
index 3c84f9d..58e3dbf 100644
--- a/arch/powerpc/configs/pseries_le_defconfig
+++ b/arch/powerpc/configs/pseries_le_defconfig
@@ -301,3 +301,4 @@ CONFIG_CRYPTO_LZO=m
 # CONFIG_CRYPTO_ANSI_CPRNG is not set
 CONFIG_CRYPTO_DEV_NX=y
 CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
+CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 895e8a2..c252ee9 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -11,6 +11,12 @@ config PPC_POWERNV
select PPC_UDBG_16550
select PPC_SCOM
select ARCH_RANDOM
+   select CPU_FREQ
+   select CPU_FREQ_GOV_PERFORMANCE
+   select CPU_FREQ_GOV_POWERSAVE
+   select CPU_FREQ_GOV_USERSPACE
+   select CPU_FREQ_GOV_ONDEMAND
+   select CPU_FREQ_GOV_CONSERVATIVE
default y
 
 config PPC_POWERNV_RTAS
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v5 3/3] powernv, cpufreq: Use cpufreq_frequency_table.driver_data to store pstate ids

2014-03-28 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

The .driver_data field in the cpufreq_frequency_table was supposed to
be private to the drivers. However at some later point, it was being
used to indicate if the particular frequency in the table is the
BOOST_FREQUENCY. After patches [1] and [2], the .driver_data is once
again private to the driver. Thus we can safely use
cpufreq_frequency_table.driver_data to store pstate_ids instead of
having to maintain a separate array powernv_pstate_ids[] for this
purpose.

[1]:
  Subject: cpufreq: don't print value of .driver_data from core
  From   : Viresh Kumar 
  url: http://marc.info/?l=linux-pm&m=139601421504709&w=2

[2]:
  Subject: cpufreq: create another field .flags in cpufreq_frequency_table
  From   : Viresh Kumar 
  url: http://marc.info/?l=linux-pm&m=139601416804702&w=2

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index e1e5197..9edccc6 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -33,7 +33,6 @@
 #define POWERNV_MAX_PSTATES256
 
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
-static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
 
 /*
  * Note: The set of pstates consists of contiguous integers, the
@@ -112,7 +111,7 @@ static int init_powernv_pstates(void)
 
pr_debug("PState id %d freq %d MHz\n", id, freq);
powernv_freqs[i].frequency = freq * 1000; /* kHz */
-   powernv_pstate_ids[i] = id;
+   powernv_freqs[i].driver_data = id;
}
/* End of list marker entry */
powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
@@ -283,7 +282,7 @@ static int powernv_cpufreq_target_index(struct 
cpufreq_policy *policy,
 {
struct powernv_smp_call_data freq_data;
 
-   freq_data.pstate_id = powernv_pstate_ids[new_index];
+   freq_data.pstate_id = powernv_freqs[new_index].driver_data;
 
/*
 * Use smp_call_function to send IPI and execute the
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH REPOST v5 2/3] powernv, cpufreq: cpufreq driver for powernv platform

2014-04-01 Thread Gautham R. Shenoy

From: Vaidyanathan Srinivasan 

Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

The code was originally written by sva...@linux.vnet.ibm.com. Over
time it was modified to accomodate bug-fixes as well as updates to the
the cpu-freq core. Relevant portions of the change logs corresponding
to those modifications are noted below:

 * The policy->cpus needs to be populated in a hotplug-invariant
   manner instead of using cpu_sibling_mask() which varies with
   cpu-hotplug. This is because the cpufreq core code copies this
   content into policy->related_cpus mask which should not vary on
   cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com]

 * Create a helper routine that can return the cpu-frequency for the
   corresponding pstate_id. Also, cache the values of the pstate_max,
   pstate_min and pstate_nominal and nr_pstates in a static structure
   so that they can be reused in the future to perform any
   validations. [Authored by e...@linux.vnet.ibm.com]

 * Create a driver attribute named cpuinfo_nominal_freq which creates
   a sysfs read-only file named cpuinfo_nominal_freq. Export the
   frequency corresponding to the nominal_pstate through this
   interface.

 Nominal frequency is the highest non-turbo frequency for the
   platform.  This is generally used for setting governor policies
   from user space for optimal energy efficiency. [Authored by
   e...@linux.vnet.ibm.com]

 * Implement a powernv_cpufreq_get(unsigned int cpu) method which will
   return the current operating frequency. Export this via the sysfs
   interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to
   powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com]

[Change log updated by e...@linux.vnet.ibm.com]

Reviewed-by: Preeti U Murthy 
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Anton Blanchard 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/reg.h|   4 +
 drivers/cpufreq/Kconfig.powerpc   |   8 +
 drivers/cpufreq/Makefile  |   1 +
 drivers/cpufreq/powernv-cpufreq.c | 342 ++
 4 files changed, 355 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1a36b8e..2189f8f 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
 #define SPRN_IC0x350   /* Virtual Instruction Count */
 #define SPRN_VTB   0x351   /* Virtual Time Base */
+#define SPRN_PMICR 0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR  0x355   /* Power Management Status Reg */
+#define SPRN_PMCR  0x374   /* Power Management Control Register */
+
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_TAR_LG8   /* Enable Target Address Register */
 #define FSCR_EBB_LG7   /* Enable Event Based Branching */
diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
index ca0021a..72564b7 100644
--- a/drivers/cpufreq/Kconfig.powerpc
+++ b/drivers/cpufreq/Kconfig.powerpc
@@ -54,3 +54,11 @@ config PPC_PASEMI_CPUFREQ
help
  This adds the support for frequency switching on PA Semi
  PWRficient processors.
+
+config POWERNV_CPUFREQ
+   tristate "CPU frequency scaling for IBM POWERNV platform"
+   depends on PPC_POWERNV
+   default y
+   help
+This adds support for CPU frequency switching on IBM POWERNV
+platform
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 7494565..0dbb963 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC64)  += pmac64-cpufreq.o
 obj-$(CONFIG_PPC_PASEMI_CPUFREQ)   += pasemi-cpufreq.o
+obj-$(CONFIG_POWERNV_CPUFREQ)  += powernv-cpufreq.o
 
 
##
 # Other platform drivers
diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
new file mode 100644
index 000..e1e5197
--- /dev/null
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -0,0 +1,342 @@
+/*
+ * POWERNV cpufreq driver for the IBM POWER processors
+ *
+ * (C) Copyright IBM 2014
+ *
+ * Autho

[PATCH REPOST v5 3/3] powernv, cpufreq: Use cpufreq_frequency_table.driver_data to store pstate ids

2014-04-01 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

The .driver_data field in the cpufreq_frequency_table was supposed to
be private to the drivers. However at some later point, it was being
used to indicate if the particular frequency in the table is the
BOOST_FREQUENCY. After patches [1] and [2], the .driver_data is once
again private to the driver. Thus we can safely use
cpufreq_frequency_table.driver_data to store pstate_ids instead of
having to maintain a separate array powernv_pstate_ids[] for this
purpose.

[1]:
  Subject: cpufreq: don't print value of .driver_data from core
  From   : Viresh Kumar 
  url: http://marc.info/?l=linux-pm&m=139601421504709&w=2

[2]:
  Subject: cpufreq: create another field .flags in cpufreq_frequency_table
  From   : Viresh Kumar 
  url: http://marc.info/?l=linux-pm&m=139601416804702&w=2

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpufreq/powernv-cpufreq.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index e1e5197..9edccc6 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -33,7 +33,6 @@
 #define POWERNV_MAX_PSTATES256
 
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
-static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
 
 /*
  * Note: The set of pstates consists of contiguous integers, the
@@ -112,7 +111,7 @@ static int init_powernv_pstates(void)
 
pr_debug("PState id %d freq %d MHz\n", id, freq);
powernv_freqs[i].frequency = freq * 1000; /* kHz */
-   powernv_pstate_ids[i] = id;
+   powernv_freqs[i].driver_data = id;
}
/* End of list marker entry */
powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
@@ -283,7 +282,7 @@ static int powernv_cpufreq_target_index(struct 
cpufreq_policy *policy,
 {
struct powernv_smp_call_data freq_data;
 
-   freq_data.pstate_id = powernv_pstate_ids[new_index];
+   freq_data.pstate_id = powernv_freqs[new_index].driver_data;
 
/*
 * Use smp_call_function to send IPI and execute the
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH REPOST v5 1/3] powernv, cpufreq: Select CPUFreq related Kconfig options for powernv

2014-04-01 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Enable CPUFreq for PowerNV. Select "performance", "powersave",
"userspace" and "ondemand" governors. Choose "ondemand" to be the
default governor.

Signed-off-by: Gautham R. Shenoy 
Signed-off-by: Srivatsa S. Bhat 
---
 arch/powerpc/configs/pseries_defconfig| 1 +
 arch/powerpc/configs/pseries_le_defconfig | 1 +
 arch/powerpc/platforms/powernv/Kconfig| 6 ++
 3 files changed, 8 insertions(+)

diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index 9ea8342b..a905063 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -306,3 +306,4 @@ CONFIG_KVM_BOOK3S_64=m
 CONFIG_KVM_BOOK3S_64_HV=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
+CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
diff --git a/arch/powerpc/configs/pseries_le_defconfig 
b/arch/powerpc/configs/pseries_le_defconfig
index 3c84f9d..58e3dbf 100644
--- a/arch/powerpc/configs/pseries_le_defconfig
+++ b/arch/powerpc/configs/pseries_le_defconfig
@@ -301,3 +301,4 @@ CONFIG_CRYPTO_LZO=m
 # CONFIG_CRYPTO_ANSI_CPRNG is not set
 CONFIG_CRYPTO_DEV_NX=y
 CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
+CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 895e8a2..c252ee9 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -11,6 +11,12 @@ config PPC_POWERNV
select PPC_UDBG_16550
select PPC_SCOM
select ARCH_RANDOM
+   select CPU_FREQ
+   select CPU_FREQ_GOV_PERFORMANCE
+   select CPU_FREQ_GOV_POWERSAVE
+   select CPU_FREQ_GOV_USERSPACE
+   select CPU_FREQ_GOV_ONDEMAND
+   select CPU_FREQ_GOV_CONSERVATIVE
default y
 
 config PPC_POWERNV_RTAS
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] book3s_hv: Handle H_DOORBELL on the guest exit path

2015-10-14 Thread Gautham R. Shenoy

Currently a CPU running a guest can receive a H_DOORBELL in the
following two cases:
1) When the CPU is napping due to CEDE or there not being a guest
vcpu.
2) The CPU is running the guest vcpu.

Case 1), the doorbell message is not cleared since we were waking up
from nap. Hence when the EE bit gets set on transition from guest to
host, the H_DOORBELL interrupt is delivered to the host and the
corresponding handler is invoked.

However in Case 2), the message gets cleared by the action of taking
the H_DOORBELL interrupt. Since the CPU was running a guest, instead
of invoking the doorbell handler, the code invokes the second-level
interrupt handler to switch the context from the guest to the host. At
this point the setting of the EE bit doesn't result in the CPU getting
the doorbell interrupt since it has already been delivered once. So,
the handler for this doorbell is never invoked!

This causes softlockups if the missed DOORBELL was as IPI sent from a
sibling subcore CPU.

This patch fixes it by explitly invoking the doorbell handler on the
exit path if the exit reason is H_DOORBELL similar to the way an
EXTERNAL interrupt is handled. Since this will also handle Case 1), we
can unconditionally clear the doorbell message in
kvmppc_check_wake_reason.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b98889e..106c7f9 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -150,6 +150,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
cmpwi   cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
beq 11f
+   cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
+   beq 15f /* Invoke the H_DOORBELL handler */
cmpwi   cr2, r12, BOOK3S_INTERRUPT_HMI
beq cr2, 14f/* HMI check */
 
@@ -174,6 +176,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
mtspr   SPRN_HSRR1, r7
b   hmi_exception_after_realmode
 
+15:mtspr SPRN_HSRR0, r8
+   mtspr SPRN_HSRR1, r7
+   ba0xe80
+
 kvmppc_primary_no_guest:
/* We handle this much like a ceded vcpu */
/* put the HDEC into the DEC, since HDEC interrupts don't wake us */
@@ -2436,14 +2442,19 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 
/* hypervisor doorbell */
 3: li  r12, BOOK3S_INTERRUPT_H_DOORBELL
+
+   /*
+* Clear the doorbell as we will invoke the handler
+* explicitly in the guest exit path.
+*/
+   lis r6, (PPC_DBELL_SERVER << (63-36))@h
+   PPC_MSGCLR(6)
/* see if it's a host IPI */
li  r3, 1
lbz r0, HSTATE_HOST_IPI(r13)
cmpwi   r0, 0
bnelr
-   /* if not, clear it and return -1 */
-   lis r6, (PPC_DBELL_SERVER << (63-36))@h
-   PPC_MSGCLR(6)
+   /* if not, return -1 */
li  r3, -1
blr
 
-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RESEND v4 3/4] cpufreq: powernv: Add a trace print for the throttle event

2016-01-12 Thread Gautham R Shenoy

Hi Shilpa,

Just saw this resend!

On Tue, Jan 12, 2016 at 04:24:26AM -0600, Shilpasri G Bhat wrote:
> Record the throttle event with a trace print replacing the printk,
> except for events like throttling below nominal and occ reset
> event which print a warning message.
> 
> Signed-off-by: Shilpasri G Bhat 
> ---

[..snip..]

> 
> -static void powernv_cpufreq_throttle_check(void *data)
> +static void powernv_cpufreq_check_pmax(void)
   ^^^
This function only contains code moved from
powernv_cpufreq_throttle_check with pr_crit/pr_warns replaced by
trace_powernv_throttle. Furthermore, it is not called from any other
place. Given that the original function was ~60 lines do we really
need to split it into two separate functions ? If yes, could it be an
inline function ?

>  {
>   unsigned int cpu = smp_processor_id();
>   unsigned int chip_id = pir_to_chip_id(hard_smp_processor_id());
> - unsigned long pmsr;
>   int pmsr_pmax, i;
> 
> - pmsr = get_pmspr(SPRN_PMSR);
> + pmsr_pmax = (s8)PMSR_MAX(get_pmspr(SPRN_PMSR));
> 
>   for (i = 0; i < nr_chips; i++)
>   if (chips[i].id == chip_id)
>   break;
> 
> - /* Check for Pmax Capping */
> - pmsr_pmax = (s8)PMSR_MAX(pmsr);
>   if (pmsr_pmax != powernv_pstate_info.max) {
>   if (chips[i].throttled)
> - goto next;
> + return;
> +
>   chips[i].throttled = true;
>   if (pmsr_pmax < powernv_pstate_info.nominal)
> - pr_crit("CPU %d on Chip %u has Pmax reduced below 
> nominal frequency (%d < %d)\n",
> - cpu, chips[i].id, pmsr_pmax,
> - powernv_pstate_info.nominal);
> - else
> - pr_info("CPU %d on Chip %u has Pmax reduced below turbo 
> frequency (%d < %d)\n",
> - cpu, chips[i].id, pmsr_pmax,
> - powernv_pstate_info.max);
> + pr_warn_once("CPU %d on Chip %u has Pmax reduced below 
> nominal frequency (%d < %d)\n",
> +  cpu, chips[i].id, pmsr_pmax,
> +  powernv_pstate_info.nominal);
> +
> + trace_powernv_throttle(chips[i].id,
> +throttle_reason[chips[i].throt_reason],
> +pmsr_pmax);
>   } else if (chips[i].throttled) {
>   chips[i].throttled = false;
> - pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
> - chips[i].id, pmsr_pmax);
> + trace_powernv_throttle(chips[i].id,
> +throttle_reason[chips[i].throt_reason],
> +pmsr_pmax);
>  }
> +}
> +
> +static void powernv_cpufreq_throttle_check(void *data)
> +{
> + unsigned long pmsr;
> +
> + pmsr = get_pmspr(SPRN_PMSR);
> +
> + /* Check for Pmax Capping */
> + powernv_cpufreq_check_pmax();
  
If you want to retain this function, you could pass pmsr as an
argument instead of computing it afresh in
powernv_cpufreq_check_pmax()

>   /* Check if Psafe_mode_active is set in PMSR. */
> -next:
>   if (pmsr & PMSR_PSAFE_ENABLE) {
>   throttled = true;
>   pr_info("Pstate set to safe frequency\n");

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RESEND v4 4/4] cpufreq: powernv: Add sysfs attributes to show throttle stats

2016-01-12 Thread Gautham R Shenoy

Hi Shilpa,

On Tue, Jan 12, 2016 at 04:24:27AM -0600, Shilpasri G Bhat wrote:

> +static inline int get_chip_index(struct kobject *kobj)

Probably have "get_chip_index(int id)". See the reason below.
> +{
> + int i, id;
> +
> + i = kstrtoint(kobj->name + 4, 0, &id);
> + if (i)
> + return i;
> +
> + for (i = 0; i < nr_chips; i++)
> + if (chips[i].id == id)
> + return i;

This pattern to obtain a chip index from the chip id is repeated in
multiple place inside this file. Might be worthwhile to move this to a
helper function, i.e get_chip_index(id)!

> + return -EINVAL;
> +}
> +
> +static ssize_t throttle_freq_show(struct kobject *kobj,
> +   struct kobj_attribute *attr, char *buf)
> +{
> + int i, count = 0, id;
> + 
We obtain the id from kobj here and then obtain the index from
id via the function below.


> + id = get_chip_index(kobj);


> + if (id < 0)
> + return id;
> +
> + for (i = 0; i < powernv_pstate_info.nr_pstates; i++)
> + count += sprintf(&buf[count], "%d %d\n",
> +powernv_freqs[i].frequency,
> +chips[id].pstate_stat[i]);
> +
> + return count;
> +}
> +
> +static struct kobj_attribute attr_throttle_frequencies =
> +__ATTR(throttle_frequencies, 0444, throttle_freq_show, NULL);
> +
--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v5 5/5] cpufreq: powernv: Add sysfs attributes to show throttle stats

2016-01-21 Thread Gautham R Shenoy

On Thu, Jan 21, 2016 at 03:08:59PM +0530, Shilpasri G Bhat wrote:
> Signed-off-by: Shilpasri G Bhat 
Reviewed-by: Gautham R. Shenoy 

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 2/5] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path

2016-01-24 Thread Gautham R Shenoy

Hello Balbir,

On Sat, Jan 23, 2016 at 07:59:20PM +1100, Balbir Singh wrote:
> On Fri, 22 Jan 2016 12:49:02 +0530
> Shilpasri G Bhat  wrote:
> 
> > cpu_to_chip_id() does a DT walk through to find out the chip id by
> > taking a contended device tree lock. This adds an unnecessary overhead
> > in a hot path. So instead of calling cpu_to_chip_id() everytime cache
> > the chip ids for all cores in the array 'core_to_chip_map' and use it
> > in the hotpath.
> > 
> > Reported-by: Anton Blanchard 
> > Signed-off-by: Shilpasri G Bhat 
> > Reviewed-by: Gautham R. Shenoy 
> 
> snip
> 
> Does the core_to_chip_map need to be updated/refreshed on/after/ a
> cpu (core) hotplug? I presume id's don't change

No, the id's don't change on cpu/core hotplug. The core_to_chip_map is
initialized in init_chip_info() where we use for_each_possible_cpu().

Thanks for the review!
> 
> Balbir
> 
--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 1/6] cpufreq: powernv: Free 'chips' on module exit

2016-01-28 Thread Gautham R Shenoy

On Thu, Jan 28, 2016 at 12:55:36PM +0530, Shilpasri G Bhat wrote:
> This will free the dynamically allocated memory of'chips' on
> module exit.
> 
> Signed-off-by: Shilpasri G Bhat 

Reviewed-by: Gautham R. Shenoy 

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path

2016-01-28 Thread Gautham R Shenoy

On Thu, Jan 28, 2016 at 12:55:38PM +0530, Shilpasri G Bhat wrote:
> cpu_to_chip_id() does a DT walk through to find out the chip id by
> taking a contended device tree lock. This adds an unnecessary overhead
> in a hot path. So instead of calling cpu_to_chip_id() everytime cache
> the chip ids for all cores in the array 'core_to_chip_map' and use it
> in the hotpath.
> 
> Reported-by: Anton Blanchard 
> Signed-off-by: Shilpasri G Bhat 

Reviewed-by: Gautham R. Shenoy 

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 5/6] cpufreq: powernv: Replace pr_info with trace print for throttle event

2016-01-28 Thread Gautham R Shenoy

On Thu, Jan 28, 2016 at 12:55:40PM +0530, Shilpasri G Bhat wrote:
> Currently we use printk message to notify the throttle event. But this
> can flood the console if the cpu is throttled frequently. So replace the
> printk with the tracepoint to notify the throttle event. And also events
> like throttle below nominal frequency and OCC_RESET are reduced to
> pr_warn/pr_warn_once as pointed by MFG to not mark them as critical
> messages. This patch adds 'throttle_reason' to struct chip to store the
> throttle reason.
> 
> Signed-off-by: Shilpasri G Bhat 

Reviewed-by: Gautham R. Shenoy 

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 6/6] cpufreq: powernv: Add sysfs attributes to show throttle stats

2016-01-28 Thread Gautham R Shenoy

Hi Shilpa,

A minor nit.
On Thu, Jan 28, 2016 at 12:55:41PM +0530, Shilpasri G Bhat wrote:

[..snip..]
> +
> +What:
> /sys/devices/system/cpu/cpufreq/chip*/throttle_reasons/
> +Date:Jan 2016
> +Contact: Linux kernel mailing list 
> + Linux for PowerPC mailing list 
> +Description: CPU Frequency throttle reason stat for the chip
> +
> + This directory contains throttle reason files. Each file gives
> + the total number of times the max frequency is throttled, except
> + for 'unthrottle_count', which gives the total number of times
> + the max frequency is unthrottled after being throttled. Below
> + are the reason attributes.
> +
> + cpu_over_temperature: Throttled due to cpu over temperature
> +
> + occ_reset: Throttled due to reset of OCC
> +
> + over_current: Throttled due to over current

Overcurrent is a single word. No need of the extra space. You could
fix that and add my Reviewed-by.

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 6/6] cpufreq: powernv: Add sysfs attributes to show throttle stats

2016-02-03 Thread Gautham R Shenoy

Hi Viresh,

> 
> What I can suggest is:
> - Move this directory inside cpuX/cpufreq/ directory, in a similar way
>   as to how we create 'stats' directory today.
> - You can then get policy->cpu, to get chip->id out of it.
> - The only disadvantage here is that the same chip directory will be
>   replicated in multiple policies, but that makes it more readable.

Thinking about it, having a sysfs group attached to a policy kobject
looks ok if replication of the same chip information across multiple
policies is not objectionable.

Regarding the table-format, it breaks the sysfs's one-value-per-file
rule. So I would still prefer each throttle reason being a separate
file which gives the number of times the chip frequency was throttled
due to that reason. We can live without the per-frequency
throttle stats listed in the throttle_status.

So, would the following be sysfs group structure be acceptable?

$ls -1 /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/
unthrottle
powercap
overtemp
supply_fault
overcurrent
occ_reset
turbo_stat
sub_turbo_stat

--
Thanks and Regards
gautham.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc/pseries: Only wait for dying CPU after call to rtas_stop_self()

2019-02-28 Thread Gautham R Shenoy

Hello Thiago,

On Fri, Feb 22, 2019 at 07:57:52PM -0300, Thiago Jung Bauermann wrote:
> When testing DLPAR CPU add/remove on a system under stress,
> pseries_cpu_die() doesn't wait long enough for a CPU to die:
> 
> [  446.983944] cpu 148 (hwid 148) Ready to die...
> [  446.984062] cpu 149 (hwid 149) Ready to die...
> [  446.993518] cpu 150 (hwid 150) Ready to die...
> [  446.993543] Querying DEAD? cpu 150 (150) shows 2
> [  446.994098] cpu 151 (hwid 151) Ready to die...
> [  447.133726] cpu 136 (hwid 136) Ready to die...
> [  447.403532] cpu 137 (hwid 137) Ready to die...
> [  447.403772] cpu 138 (hwid 138) Ready to die...
> [  447.403839] cpu 139 (hwid 139) Ready to die...
> [  447.403887] cpu 140 (hwid 140) Ready to die...
> [  447.403937] cpu 141 (hwid 141) Ready to die...
> [  447.403979] cpu 142 (hwid 142) Ready to die...
> [  447.404038] cpu 143 (hwid 143) Ready to die...
> [  447.513546] cpu 128 (hwid 128) Ready to die...
> [  447.693533] cpu 129 (hwid 129) Ready to die...
> [  447.693999] cpu 130 (hwid 130) Ready to die...
> [  447.703530] cpu 131 (hwid 131) Ready to die...
> [  447.704087] Querying DEAD? cpu 132 (132) shows 2
> [  447.704102] cpu 132 (hwid 132) Ready to die...
> [  447.713534] cpu 133 (hwid 133) Ready to die...
> [  447.714064] Querying DEAD? cpu 134 (134) shows 2
> 
> This is a race between one CPU stopping and another one calling
> pseries_cpu_die() to wait for it to stop. That function does a short busy
> loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify
> that it is stopped, but I think there's a lot for the stopping CPU to do
> which may take longer than this loop allows.
> 
> As can be seen in the dmesg right before or after the "Querying DEAD?"
> messages, if pseries_cpu_die() waited a little longer it would have seen
> the CPU in the stopped state.

> 
> I see two cases that can be causing this race:
> 
> 1. It's possible that CPU 134 was inactive at the time it was unplugged. In
>that case, dlpar_offline_cpu() calls H_PROD on that CPU and immediately
>calls pseries_cpu_die(). Meanwhile, the prodded CPU activates and start
>the process of stopping itself. It's possible that the busy loop is not
>long enough to allow for the CPU to wake up and complete the stopping
>process.

The problem is a bit more severe since, after printing "Querying
DEAD?" for CPU X, this CPU can prod another offline CPU Y on the same
core which, on waking up, will call rtas_stop_self. Thus we can have two
concurrent calls to rtas-stop-self, which is prohibited by the PAPR.


> 
> 2. If CPU 134 was online at the time it was unplugged, it would have gone
>through the new CPU hotplug state machine in kernel/cpu.c that was
>introduced in v4.6 to get itself stopped. It's possible that the busy
>loop in pseries_cpu_die() was long enough for the older hotplug code but
>not for the new hotplug state machine.

I haven't been able to observe the "Querying DEAD?" messages for the
online CPU which was being offlined and dlpar'ed out.

> 
> I don't know if this race condition has any ill effects, but we can make
> the race a lot more even if we only start querying if the CPU is stopped
> when the stopping CPU is close to call rtas_stop_self().
> 
> Since pseries_mach_cpu_die() sets the CPU current state to offline almost
> immediately before calling rtas_stop_self(), we use that as a signal that
> it is either already stopped or very close to that point, and we can start
> the busy loop.
> 
> As suggested by Michael Ellerman, this patch also changes the busy loop to
> wait for a fixed amount of wall time.
> 
> Signed-off-by: Thiago Jung Bauermann 
> ---
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> I tried to estimate good amounts for the timeout and loop delays, but
> I'm not sure how reasonable my numbers are. The busy loops will wait for
> 100 µs between each try, and spin_event_timeout() will timeout after
> 100 ms. I'll be happy to change these values if you have better
> suggestions.

Based on the measurements that I did on a POWER9 system, in successful
cases of smp_query_cpu_stopped(cpu) returning affirmative, the maximum
time spent inside the loop was was 10ms.


> Gautham was able to test this patch and it solved the race condition.
> 
> v1 was a cruder patch which just increased the number of loops:
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html
> 
> v1 also mentioned a kernel crash but Gautham narrowed it down to a bug
> in RTAS, which is in the process of being fixed.
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 97feb6e79f1a..424146cc752e 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -214,13 +214,21 @@ static void pseries_cpu_die(unsigned int cpu)
>   msleep(1);
>

Re: [PATCH] powernv: powercap: Add hard minimum powercap

2019-02-28 Thread Gautham R Shenoy

Hi Shilpa,

On Thu, Feb 28, 2019 at 11:25:25AM +0530, Shilpasri G Bhat wrote:
> Hi,
> 
> On 02/28/2019 10:14 AM, Daniel Axtens wrote:
> > Shilpasri G Bhat  writes:
> > 
> >> In POWER9, OCC(On-Chip-Controller) provides for hard and soft system
> >> powercapping range. The hard powercap range is guaranteed while soft
> >> powercap may or may not be asserted due to various power-thermal
> >> reasons based on system configuration and workloads. This patch adds
> >> a sysfs file to export the hard minimum powercap limit to allow the
> >> user to set the appropriate powercap value that can be managed by the
> >> system.
> > 
> > Maybe it's common terminology and I'm just not aware of it, but what do
> > you mean by "asserted"? It doesn't appear elsewhere in the documentation
> > you're patching, and it's not a use of assert that I'm familiar with...
> > 
> > Regards,
> > Daniel
> > 
> 
> I meant to say powercap will not be assured in the soft powercap range, i.e,
> system may or may not be throttled of CPU frequency to remain within the 
> powercap.
> 
> I can reword the document and commit message.

I agree with Daniel. How about replacing "asserted" with "enforced by
the OCC"?


> 
> Thanks and Regards,
> Shilpa

--
Thanks and Regards
gautham.

> 
> >>
> >> Signed-off-by: Shilpasri G Bhat 
> >> ---
> >>  .../ABI/testing/sysfs-firmware-opal-powercap   | 10 
> >>  arch/powerpc/platforms/powernv/opal-powercap.c | 66 
> >> +-
> >>  2 files changed, 37 insertions(+), 39 deletions(-)
> >>
> >> diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-powercap 
> >> b/Documentation/ABI/testing/sysfs-firmware-opal-powercap
> >> index c9b66ec..65db4c1 100644
> >> --- a/Documentation/ABI/testing/sysfs-firmware-opal-powercap
> >> +++ b/Documentation/ABI/testing/sysfs-firmware-opal-powercap
> >> @@ -29,3 +29,13 @@ Description:System powercap directory and 
> >> attributes applicable for
> >>  creates a request for setting a new-powercap. The
> >>  powercap requested must be between powercap-min
> >>  and powercap-max.
> >> +
> >> +What: 
> >> /sys/firmware/opal/powercap/system-powercap/powercap-hard-min
> >> +Date: Feb 2019
> >> +Contact:  Linux for PowerPC mailing list 
> >> +Description:  Hard minimum powercap
> >> +
> >> +  This file provides the hard minimum powercap limit in
> >> +  Watts. The powercap value above hard minimum is always
> >> +  guaranteed to be asserted and the powercap value below
> >> +  the hard minimum limit may or may not be guaranteed.
> >> diff --git a/arch/powerpc/platforms/powernv/opal-powercap.c 
> >> b/arch/powerpc/platforms/powernv/opal-powercap.c
> >> index d90ee4f..38408e7 100644
> >> --- a/arch/powerpc/platforms/powernv/opal-powercap.c
> >> +++ b/arch/powerpc/platforms/powernv/opal-powercap.c
> >> @@ -139,10 +139,24 @@ static void powercap_add_attr(int handle, const char 
> >> *name,
> >>attr->handle = handle;
> >>sysfs_attr_init(&attr->attr.attr);
> >>attr->attr.attr.name = name;
> >> -  attr->attr.attr.mode = 0444;
> >> +
> >> +  if (!strncmp(name, "powercap-current", strlen(name))) {
> >> +  attr->attr.attr.mode = 0664;
> >> +  attr->attr.store = powercap_store;
> >> +  } else {
> >> +  attr->attr.attr.mode = 0444;
> >> +  }
> >> +
> >>attr->attr.show = powercap_show;
> >>  }
> >>  
> >> +static const char * const powercap_strs[] = {
> >> +  "powercap-max",
> >> +  "powercap-min",
> >> +  "powercap-current",
> >> +  "powercap-hard-min",
> >> +};
> >> +
> >>  void __init opal_powercap_init(void)
> >>  {
> >>struct device_node *powercap, *node;
> >> @@ -167,60 +181,34 @@ void __init opal_powercap_init(void)
> >>  
> >>i = 0;
> >>for_each_child_of_node(powercap, node) {
> >> -  u32 cur, min, max;
> >> -  int j = 0;
> >> -  bool has_cur = false, has_min = false, has_max = false;
> >> +  u32 id;
> >> +  int j, count = 0;
> >>  
> >> -  if (!of_property_read_u32(node, "powercap-min", &min)) {
> >> -  j++;
> >> -  has_min = true;
> >> -  }
> >> -
> >> -  if (!of_property_read_u32(node, "powercap-max", &max)) {
> >> -  j++;
> >> -  has_max = true;
> >> -  }
> >> +  for (j = 0; j < ARRAY_SIZE(powercap_strs); j++)
> >> +  if (!of_property_read_u32(node, powercap_strs[j], &id))
> >> +  count++;
> >>  
> >> -  if (!of_property_read_u32(node, "powercap-current", &cur)) {
> >> -  j++;
> >> -  has_cur = true;
> >> -  }
> >> -
> >> -  pcaps[i].pattrs = kcalloc(j, sizeof(struct powercap_attr),
> >> +  pcaps[i].pattrs = kcalloc(count, sizeof(struct powercap_attr),
> >>  GFP_KERNEL);
> >>if (!pcaps[i].pattrs)
> >>

Re: [PATCH v2] powerpc/powernv/idle: Restore IAMR after idle

2019-02-28 Thread Gautham R Shenoy

Hello Russell,

On Fri, Feb 08, 2019 at 10:11:03PM +1100, Russell Currey wrote:
> Without restoring the IAMR after idle, execution prevention on POWER9
> with Radix MMU is overwritten and the kernel can freely execute userspace 
> without
> faulting.
> 
> This is necessary when returning from any stop state that modifies user
> state, as well as hypervisor state.
> 
> To test how this fails without this patch, load the lkdtm driver and
> do the following:
> 
>echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
> 
> which won't fault, then boot the kernel with powersave=off, where it
> will fault.  Applying this patch will fix this.
> 
> Fixes: 3b10d0095a1e ("powerpc/mm/radix: Prevent kernel execution of user
> space")
> Cc: 
> Signed-off-by: Russell Currey 
> ---
> Since v1:
>   - no longer use paca to save IAMR, instead use _DAR (thanks mpe)

Looks good to me. Once we move to Nick Piggin's C-based save/restore
code, we will be saving all these SPR values on the stack anyway.

Reviewed-by: Gautham R. Shenoy 

--
Thanks and Regards
gautham.

[PATCH] pseries/energy: Use OF accessor functions to read ibm, drc-indexes

2019-03-08 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent,
we currently use of_read_property() to obtain the pointer to the array
corresponding to the property "ibm,drc-indexes". The elements of this
array are of type __be32, but are accessed without any conversion to
the OS-endianness, which is buggy on a Little Endian OS.

Fix this by using of_property_read_u32_index() accessor function to
safely read the elements of the array.

Fixes: commit e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU 
indexes")
Cc:  #v4.16+
Reported-by: Pavithra R. Prakash 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/pseries/pseries_energy.c | 27 -
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
index 6ed2212..1c4d1ba 100644
--- a/arch/powerpc/platforms/pseries/pseries_energy.c
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -77,18 +77,27 @@ static u32 cpu_to_drc_index(int cpu)
 
ret = drc.drc_index_start + (thread_index * drc.sequential_inc);
} else {
-   const __be32 *indexes;
-
-   indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
-   if (indexes == NULL)
-   goto err_of_node_put;
+   u32 nr_drc_indexes, thread_drc_index;
 
/*
-* The first element indexes[0] is the number of drc_indexes
-* returned in the list.  Hence thread_index+1 will get the
-* drc_index corresponding to core number thread_index.
+* The first element of ibm,drc-indexes array is the
+* number of drc_indexes returned in the list.  Hence
+* thread_index+1 will get the drc_index corresponding
+* to core number thread_index.
 */
-   ret = indexes[thread_index + 1];
+   rc = of_property_read_u32_index(dn, "ibm,drc-indexes",
+   0, &nr_drc_indexes);
+   if (rc)
+   goto err_of_node_put;
+
+   WARN_ON(thread_index > nr_drc_indexes);
+   rc = of_property_read_u32_index(dn, "ibm,drc-indexes",
+   thread_index + 1,
+   &thread_drc_index);
+   if (rc)
+   goto err_of_node_put;
+
+   ret = thread_drc_index;
}
 
rc = 0;
-- 
1.9.4

Re: [PATCH v3] powerpc/pseries: Only wait for dying CPU after call to rtas_stop_self()

2019-03-12 Thread Gautham R Shenoy

Hello Thiago,

On Mon, Mar 11, 2019 at 04:35:17PM -0300, Thiago Jung Bauermann wrote:
> When testing DLPAR CPU add/remove on a system under stress,
> pseries_cpu_die() doesn't wait long enough for a CPU to die:
> 
> [  446.983944] cpu 148 (hwid 148) Ready to die...
> [  446.984062] cpu 149 (hwid 149) Ready to die...
> [  446.993518] cpu 150 (hwid 150) Ready to die...
> [  446.993543] Querying DEAD? cpu 150 (150) shows 2
> [  446.994098] cpu 151 (hwid 151) Ready to die...
> [  447.133726] cpu 136 (hwid 136) Ready to die...
> [  447.403532] cpu 137 (hwid 137) Ready to die...
> [  447.403772] cpu 138 (hwid 138) Ready to die...
> [  447.403839] cpu 139 (hwid 139) Ready to die...
> [  447.403887] cpu 140 (hwid 140) Ready to die...
> [  447.403937] cpu 141 (hwid 141) Ready to die...
> [  447.403979] cpu 142 (hwid 142) Ready to die...
> [  447.404038] cpu 143 (hwid 143) Ready to die...
> [  447.513546] cpu 128 (hwid 128) Ready to die...
> [  447.693533] cpu 129 (hwid 129) Ready to die...
> [  447.693999] cpu 130 (hwid 130) Ready to die...
> [  447.703530] cpu 131 (hwid 131) Ready to die...
> [  447.704087] Querying DEAD? cpu 132 (132) shows 2
> [  447.704102] cpu 132 (hwid 132) Ready to die...
> [  447.713534] cpu 133 (hwid 133) Ready to die...
> [  447.714064] Querying DEAD? cpu 134 (134) shows 2
> 
> This is a race between one CPU stopping and another one calling
> pseries_cpu_die() to wait for it to stop. That function does a short busy
> loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify
> that it is stopped, but I think there's a lot for the stopping CPU to do
> which may take longer than this loop allows.
> 
> As can be seen in the dmesg right before or after the "Querying DEAD?"
> messages, if pseries_cpu_die() waited a little longer it would have seen
> the CPU in the stopped state.
> 
> What I think is going on is that CPU 134 was inactive at the time it was
> unplugged. In that case, dlpar_offline_cpu() calls H_PROD on that CPU and
> immediately calls pseries_cpu_die(). Meanwhile, the prodded CPU activates
> and start the process of stopping itself. The busy loop is not long enough
> to allow for the CPU to wake up and complete the stopping process.
> 
> This can be a problem because if the busy loop finishes too early, then the
> kernel may offline another CPU before the previous one finished dying,
> which would lead to two concurrent calls to rtas-stop-self, which is
> prohibited by the PAPR.
> 
> We can make the race a lot more even if we only start querying if the CPU
> is stopped when the stopping CPU is close to call rtas_stop_self(). Since
> pseries_mach_cpu_die() sets the CPU current state to offline almost
> immediately before calling rtas_stop_self(), we use that as a signal that
> it is either already stopped or very close to that point, and we can start
> the busy loop.
> 
> As suggested by Michael Ellerman, this patch also changes the busy loop to
> wait for a fixed amount of wall time. Based on the measurements that
> Gautham did on a POWER9 system, in successful cases of
> smp_query_cpu_stopped(cpu) returning affirmative, the maximum time spent
> inside the loop was was 10 ms. This patch loops for 20 ms just be sure.
> 
> Signed-off-by: Thiago Jung Bauermann 

Thanks for this version. I have tested the patch and we no longer see
the "Querying DEAD? cpu X (Y) shows 2" message.


Tested-and-Reviewed-by: Gautham R. Shenoy 


--
Thanks and Regards
gautham.

Re: [PATCH 1/2] cpuidle : auto-promotion for cpuidle states

2019-04-04 Thread Gautham R Shenoy

Hello Daniel,

On Thu, Apr 4, 2019 at 3:52 PM Daniel Lezcano  wrote:
>
>
> Hi Abhishek,
>
> thanks for taking the time to test the different scenario and give us
> the numbers.
>
> On 01/04/2019 07:11, Abhishek wrote:
> >

[.. snip..]

>
>  In case of POWER, this is problematic, when the predicted state in the
>  aforementioned scenario is a lite stop state, as such lite states will
>  inhibit SMT folding, thereby depriving the other threads in the core
>  from
>  using the core resources.
>
> I can understand an idle state can prevent other threads to use the core
> resources. But why a deeper idle state does not prevent this also?

On POWER9, we have the following classes of platform idle states
(called stop states)

lite: These do not lose any context including the user context. In
this state
   GPRs are also preserved (stop0_lite)

shallow : These lose user context,but not the hypervisor context. So
GPRs are lost but
   not SPRs. (stop0, stop1, stop2)

deep: These lose hypervisor context. (stop4, stop5)

In the case of lite stop state, only instruction dispatch on the CPU thread
is paused. The thread does not  give up its registers set in this state for the
use of its busy sibling threads in the core.  Hence, SMT folding does not
happen in this state.

With respect to shallow and deep states, not only is the instruction dispatch
paused, it also gives up its registers set for the other siblings to use
These stop states are beneficial for SMT folding.

Hence, if a CPU thread remains in a lite state for too long,
its siblings in the core would not be able to utilize the full resources
of the core for that duration, thereby inhibiting single
thread performance. This is not the case with non-lite states.

-- 
Thanks and Regards
gautham.

Re: [PATCH] powerpc/watchdog: Use hrtimers for per-CPU heartbeat

2019-04-04 Thread Gautham R Shenoy

Hello Nicholas,

On Tue, Apr 2, 2019 at 4:57 PM Nicholas Piggin  wrote:
>
> Using a jiffies timer creates a dependency on the tick_do_timer_cpu
> incrementing jiffies. If that CPU has locked up and jiffies is not
> incrementing, the watchdog heartbeat timer for all CPUs stops and
> creates false positives and confusing warnings on local CPUs, and
> also causes the SMP detector to stop, so the root cause is never
> detected.
>
> Fix this by using hrtimer based timers for the watchdog heartbeat,
> like the generic kernel hardlockup detector.
>
> Reported-by: Ravikumar Bangoria 
> Signed-off-by: Nicholas Piggin 

[..snip..]

> @@ -325,19 +325,21 @@ EXPORT_SYMBOL(arch_touch_nmi_watchdog);
>
>  static void start_watchdog_timer_on(unsigned int cpu)
>  {
> -   struct timer_list *t = per_cpu_ptr(&wd_timer, cpu);
> +   struct hrtimer *hrtimer = this_cpu_ptr(&wd_hrtimer);

This function can be called during the initialization via

watchdog_nmi_start -->
for_each_online_cpu(cpu)
   start_wd_on_cpu(cpu) -->
   start_watchdog_timer_on(cpu)

Thus, it is not guarateed that we are always calling
start_watchdog_timer_on() from the CPU where
we want to start the watchdog timer.

Thus, should we be calling this function from start_wd_on_cpu() via an
smp_call_function_single() ?


>
> per_cpu(wd_timer_tb, cpu) = get_tb();
>
> -   timer_setup(t, wd_timer_fn, TIMER_PINNED);
> -   wd_timer_reset(cpu, t);
> +   hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +   hrtimer->function = watchdog_timer_fn;
> +   hrtimer_start(hrtimer, ms_to_ktime(wd_timer_period_ms),
> + HRTIMER_MODE_REL_PINNED);
>  }
>
>  static void stop_watchdog_timer_on(unsigned int cpu)
>  {
> -   struct timer_list *t = per_cpu_ptr(&wd_timer, cpu);
> +   struct hrtimer *hrtimer = this_cpu_ptr(&wd_hrtimer);
>
> -   del_timer_sync(t);
> +   hrtimer_cancel(hrtimer);
>  }
>
>  static int start_wd_on_cpu(unsigned int cpu)
> --
> 2.20.1
>


-- 
Thanks and Regards
gautham.

Re: [PATCH v2] powerpc/watchdog: Use hrtimers for per-CPU heartbeat

2019-04-11 Thread Gautham R Shenoy

Hello Nicholas,


On Tue, Apr 09, 2019 at 02:40:05PM +1000, Nicholas Piggin wrote:
> Using a jiffies timer creates a dependency on the tick_do_timer_cpu
> incrementing jiffies. If that CPU has locked up and jiffies is not
> incrementing, the watchdog heartbeat timer for all CPUs stops and
> creates false positives and confusing warnings on local CPUs, and
> also causes the SMP detector to stop, so the root cause is never
> detected.
> 
> Fix this by using hrtimer based timers for the watchdog heartbeat,
> like the generic kernel hardlockup detector.
> 
> Cc: Gautham R. Shenoy 
> Reported-by: Ravikumar Bangoria 
> Signed-off-by: Nicholas Piggin 

Looks good to me.

Reviewed-by: Gautham R. Shenoy 


--
Thanks and Regards
gautham.

[PATCH v3 4/6] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2020-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

The total PURR and SPURR ticks are already exposed via the per-cpu
sysfs files "purr" and "spurr". This patch adds support for exposing
the idle PURR and SPURR ticks via new per-cpu sysfs files named
"idle_purr" and "idle_spurr".

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/sysfs.c | 54 ++---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 479c706..c9ddb83 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cacheinfo.h"
@@ -760,6 +761,42 @@ static void create_svm_file(void)
 }
 #endif /* CONFIG_PPC_SVM */
 
+static void read_idle_purr(void *val)
+{
+   u64 *ret = val;
+
+   *ret = read_this_idle_purr();
+}
+
+static ssize_t idle_purr_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   u64 val;
+
+   smp_call_function_single(cpu->dev.id, read_idle_purr, &val, 1);
+   return sprintf(buf, "%llx\n", val);
+}
+static DEVICE_ATTR(idle_purr, 0400, idle_purr_show, NULL);
+
+static void read_idle_spurr(void *val)
+{
+   u64 *ret = val;
+
+   *ret = read_this_idle_spurr();
+}
+
+static ssize_t idle_spurr_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   u64 val;
+
+   smp_call_function_single(cpu->dev.id, read_idle_spurr, &val, 1);
+   return sprintf(buf, "%llx\n", val);
+}
+static DEVICE_ATTR(idle_spurr, 0400, idle_spurr_show, NULL);
+
 static int register_cpu_online(unsigned int cpu)
 {
struct cpu *c = &per_cpu(cpu_devices, cpu);
@@ -823,10 +860,15 @@ static int register_cpu_online(unsigned int cpu)
if (!firmware_has_feature(FW_FEATURE_LPAR))
add_write_permission_dev_attr(&dev_attr_purr);
device_create_file(s, &dev_attr_purr);
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_create_file(s, &dev_attr_idle_purr);
}
 
-   if (cpu_has_feature(CPU_FTR_SPURR))
+   if (cpu_has_feature(CPU_FTR_SPURR)) {
device_create_file(s, &dev_attr_spurr);
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_create_file(s, &dev_attr_idle_spurr);
+   }
 
if (cpu_has_feature(CPU_FTR_DSCR))
device_create_file(s, &dev_attr_dscr);
@@ -910,11 +952,17 @@ static int unregister_cpu_online(unsigned int cpu)
device_remove_file(s, &dev_attr_mmcra);
 #endif /* CONFIG_PMU_SYSFS */
 
-   if (cpu_has_feature(CPU_FTR_PURR))
+   if (cpu_has_feature(CPU_FTR_PURR)) {
device_remove_file(s, &dev_attr_purr);
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_remove_file(s, &dev_attr_idle_purr);
+   }
 
-   if (cpu_has_feature(CPU_FTR_SPURR))
+   if (cpu_has_feature(CPU_FTR_SPURR)) {
device_remove_file(s, &dev_attr_spurr);
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_remove_file(s, &dev_attr_idle_spurr);
+   }
 
if (cpu_has_feature(CPU_FTR_DSCR))
device_remove_file(s, &dev_attr_dscr);
-- 
1.9.4

[PATCH v3 2/6] powerpc/idle: Add accessor function to always read latest idle PURR

2020-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog().  Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.

However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this will
be done in a future patch), then, the value of the idle PURR thus read
will not include the cycles spent in the most recent idle period.

This patch addresses the issue by providing accessor function to read
the idle PURR such such that it includes the cycles spent in the most
recent idle period, if we read it between pseries_idle_prolog() and
pseries_idle_epilog(). In order to achieve it, the patch saves the
snapshot of PURR in pseries_idle_prolog() in a per-cpu variable,
instead of on the stack, so that it can be accessed from an interrupt
context.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 46 +++---
 arch/powerpc/platforms/pseries/setup.c |  7 +++---
 drivers/cpuidle/cpuidle-pseries.c  | 15 +--
 3 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
index e838ea5..7552823 100644
--- a/arch/powerpc/include/asm/idle.h
+++ b/arch/powerpc/include/asm/idle.h
@@ -3,10 +3,27 @@
 #define _ASM_POWERPC_IDLE_H
 #include 
 
-static inline void pseries_idle_prolog(unsigned long *in_purr)
+DECLARE_PER_CPU(u64, idle_entry_purr_snap);
+
+static inline void snapshot_purr_idle_entry(void)
+{
+   *this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR);
+}
+
+static inline void update_idle_purr_accounting(void)
+{
+   u64 wait_cycles;
+   u64 in_purr = *this_cpu_ptr(&idle_entry_purr_snap);
+
+   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
+   wait_cycles += mfspr(SPRN_PURR) - in_purr;
+   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+}
+
+static inline void pseries_idle_prolog(void)
 {
ppc64_runlatch_off();
-   *in_purr = mfspr(SPRN_PURR);
+   snapshot_purr_idle_entry();
/*
 * Indicate to the HV that we are idle. Now would be
 * a good time to find other work to dispatch.
@@ -14,15 +31,26 @@ static inline void pseries_idle_prolog(unsigned long 
*in_purr)
get_lppaca()->idle = 1;
 }
 
-static inline void pseries_idle_epilog(unsigned long in_purr)
+static inline void pseries_idle_epilog(void)
 {
-   u64 wait_cycles;
-
-   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
-   wait_cycles += mfspr(SPRN_PURR) - in_purr;
-   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+   update_idle_purr_accounting();
get_lppaca()->idle = 0;
-
ppc64_runlatch_on();
 }
+
+static inline u64 read_this_idle_purr(void)
+{
+   /*
+* If we are reading from an idle context, update the
+* idle-purr cycles corresponding to the last idle period.
+* Since the idle context is not yet over, take a fresh
+* snapshot of the idle-purr.
+*/
+   if (unlikely(get_lppaca()->idle == 1)) {
+   update_idle_purr_accounting();
+   snapshot_purr_idle_entry();
+   }
+
+   return be64_to_cpu(get_lppaca()->wait_state_cycles);
+}
 #endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 2f53e6b..4905c96 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -318,10 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+DEFINE_PER_CPU(u64, idle_entry_purr_snap);
 static void pseries_lpar_idle(void)
 {
-   unsigned long in_purr;
-
/*
 * Default handler to go into low thread priority and possibly
 * low power mode by ceding processor to hypervisor
@@ -331,7 +330,7 @@ static void pseries_lpar_idle(void)
return;
 
/* Indicate to hypervisor that we are idle. */
-   pseries_idle_prolog(&in_purr);
+   pseries_idle_prolog();
 
/*
 * Yield the processor to the hypervisor.  We return if
@@ -342,7 +341,7 @@ static void pseries_lpar_idle(void)
 */
cede_processor();
 
-   pseries_idle_epilog(in_purr);
+   pseries_idle_epilog();
 }
 
 /*
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 46d5e05..6513ef2 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -36,12 +36,11 @@ static int snooze_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
 {
-   unsig

[PATCH v3 3/6] powerpc/pseries: Account for SPURR ticks on idle CPUs

2020-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

Via pseries_idle_prolog(), pseries_idle_epilog(), we track the idle
PURR ticks in the VPA variable "wait_state_cycles". This patch extends
the support to account for the idle SPURR ticks. It also provides an
accessor function to accurately reads idle SPURR ticks.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 33 +
 arch/powerpc/platforms/pseries/setup.c |  2 ++
 2 files changed, 35 insertions(+)

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
index 7552823..a375589 100644
--- a/arch/powerpc/include/asm/idle.h
+++ b/arch/powerpc/include/asm/idle.h
@@ -3,13 +3,20 @@
 #define _ASM_POWERPC_IDLE_H
 #include 
 
+DECLARE_PER_CPU(u64, idle_spurr_cycles);
 DECLARE_PER_CPU(u64, idle_entry_purr_snap);
+DECLARE_PER_CPU(u64, idle_entry_spurr_snap);
 
 static inline void snapshot_purr_idle_entry(void)
 {
*this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR);
 }
 
+static inline void snapshot_spurr_idle_entry(void)
+{
+   *this_cpu_ptr(&idle_entry_spurr_snap) = mfspr(SPRN_SPURR);
+}
+
 static inline void update_idle_purr_accounting(void)
 {
u64 wait_cycles;
@@ -20,10 +27,19 @@ static inline void update_idle_purr_accounting(void)
get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
 }
 
+static inline void update_idle_spurr_accounting(void)
+{
+   u64 *idle_spurr_cycles_ptr = this_cpu_ptr(&idle_spurr_cycles);
+   u64 in_spurr = *this_cpu_ptr(&idle_entry_spurr_snap);
+
+   *idle_spurr_cycles_ptr += mfspr(SPRN_SPURR) - in_spurr;
+}
+
 static inline void pseries_idle_prolog(void)
 {
ppc64_runlatch_off();
snapshot_purr_idle_entry();
+   snapshot_spurr_idle_entry();
/*
 * Indicate to the HV that we are idle. Now would be
 * a good time to find other work to dispatch.
@@ -34,6 +50,7 @@ static inline void pseries_idle_prolog(void)
 static inline void pseries_idle_epilog(void)
 {
update_idle_purr_accounting();
+   update_idle_spurr_accounting();
get_lppaca()->idle = 0;
ppc64_runlatch_on();
 }
@@ -53,4 +70,20 @@ static inline u64 read_this_idle_purr(void)
 
return be64_to_cpu(get_lppaca()->wait_state_cycles);
 }
+
+static inline u64 read_this_idle_spurr(void)
+{
+   /*
+* If we are reading from an idle context, update the
+* idle-spurr cycles corresponding to the last idle period.
+* Since the idle context is not yet over, take a fresh
+* snapshot of the idle-spurr.
+*/
+   if (get_lppaca()->idle == 1) {
+   update_idle_spurr_accounting();
+   snapshot_spurr_idle_entry();
+   }
+
+   return *this_cpu_ptr(&idle_spurr_cycles);
+}
 #endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 4905c96..1b55e80 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -318,7 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+DEFINE_PER_CPU(u64, idle_spurr_cycles);
 DEFINE_PER_CPU(u64, idle_entry_purr_snap);
+DEFINE_PER_CPU(u64, idle_entry_spurr_snap);
 static void pseries_lpar_idle(void)
 {
/*
-- 
1.9.4

[PATCH v3 5/6] Documentation: Document sysfs interfaces purr, spurr, idle_purr, idle_spurr

2020-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Add documentation for the following sysfs interfaces:
/sys/devices/system/cpu/cpuX/purr
/sys/devices/system/cpu/cpuX/spurr
/sys/devices/system/cpu/cpuX/idle_purr
/sys/devices/system/cpu/cpuX/idle_spurr

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 39 ++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 2e0e3b4..bc07677 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -580,3 +580,42 @@ Description:   Secure Virtual Machine
If 1, it means the system is using the Protected Execution
Facility in POWER9 and newer processors. i.e., it is a Secure
Virtual Machine.
+
+What:  /sys/devices/system/cpu/cpuX/purr
+Date:  Apr 2005
+Contact:   Linux for PowerPC mailing list 
+Description:   PURR ticks for this CPU since the system boot.
+
+   The Processor Utilization Resources Register (PURR) is
+   a 64-bit counter which provides an estimate of the
+   resources used by the CPU thread. The contents of this
+   register increases monotonically. This sysfs interface
+   exposes the number of PURR ticks for cpuX.
+
+What:  /sys/devices/system/cpu/cpuX/spurr
+Date:  Dec 2006
+Contact:   Linux for PowerPC mailing list 
+Description:   SPURR ticks for this CPU since the system boot.
+
+   The Scaled Processor Utilization Resources Register
+   (SPURR) is a 64-bit counter that provides a frequency
+   invariant estimate of the resources used by the CPU
+   thread. The contents of this register increases
+   monotonically. This sysfs interface exposes the number
+   of SPURR ticks for cpuX.
+
+What:  /sys/devices/system/cpu/cpuX/idle_purr
+Date:  Mar 2020
+Contact:   Linux for PowerPC mailing list 
+Description:   PURR ticks for cpuX when it was idle.
+
+   This sysfs interface exposes the number of PURR ticks
+   for cpuX when it was idle.
+
+What:  /sys/devices/system/cpu/cpuX/idle_spurr
+Date:  Mar 2020
+Contact:   Linux for PowerPC mailing list 
+Description:   SPURR ticks for cpuX when it was idle.
+
+   This sysfs interface exposes the number of SPURR ticks
+   for cpuX when it was idle.
-- 
1.9.4

[PATCH v3 1/6] powerpc: Move idle_loop_prolog()/epilog() functions to header file

2020-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently prior to entering an idle state on a Linux Guest, the
pseries cpuidle driver implement an idle_loop_prolog() and
idle_loop_epilog() functions which ensure that idle_purr is correctly
computed, and the hypervisor is informed that the CPU cycles have been
donated.

These prolog and epilog functions are also required in the default
idle call, i.e pseries_lpar_idle(). Hence move these accessor
functions to a common header file and call them from
pseries_lpar_idle(). Since the existing header files such as
asm/processor.h have enough clutter, create a new header file
asm/idle.h. Finally rename idle_loop_prolog() and idle_loop_epilog()
to pseries_idle_prolog() and pseries_idle_epilog() as they are only
relavent for on pseries guests.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 28 ++
 arch/powerpc/platforms/pseries/setup.c |  7 +--
 drivers/cpuidle/cpuidle-pseries.c  | 36 +++---
 3 files changed, 40 insertions(+), 31 deletions(-)
 create mode 100644 arch/powerpc/include/asm/idle.h

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
new file mode 100644
index 000..e838ea5
--- /dev/null
+++ b/arch/powerpc/include/asm/idle.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _ASM_POWERPC_IDLE_H
+#define _ASM_POWERPC_IDLE_H
+#include 
+
+static inline void pseries_idle_prolog(unsigned long *in_purr)
+{
+   ppc64_runlatch_off();
+   *in_purr = mfspr(SPRN_PURR);
+   /*
+* Indicate to the HV that we are idle. Now would be
+* a good time to find other work to dispatch.
+*/
+   get_lppaca()->idle = 1;
+}
+
+static inline void pseries_idle_epilog(unsigned long in_purr)
+{
+   u64 wait_cycles;
+
+   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
+   wait_cycles += mfspr(SPRN_PURR) - in_purr;
+   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+   get_lppaca()->idle = 0;
+
+   ppc64_runlatch_on();
+}
+#endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 0c8421d..2f53e6b 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -319,6 +320,8 @@ static int alloc_dispatch_log_kmem_cache(void)
 
 static void pseries_lpar_idle(void)
 {
+   unsigned long in_purr;
+
/*
 * Default handler to go into low thread priority and possibly
 * low power mode by ceding processor to hypervisor
@@ -328,7 +331,7 @@ static void pseries_lpar_idle(void)
return;
 
/* Indicate to hypervisor that we are idle. */
-   get_lppaca()->idle = 1;
+   pseries_idle_prolog(&in_purr);
 
/*
 * Yield the processor to the hypervisor.  We return if
@@ -339,7 +342,7 @@ static void pseries_lpar_idle(void)
 */
cede_processor();
 
-   get_lppaca()->idle = 0;
+   pseries_idle_epilog(in_purr);
 }
 
 /*
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 74c2479..46d5e05 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct cpuidle_driver pseries_idle_driver = {
@@ -31,29 +32,6 @@ struct cpuidle_driver pseries_idle_driver = {
 static u64 snooze_timeout __read_mostly;
 static bool snooze_timeout_en __read_mostly;
 
-static inline void idle_loop_prolog(unsigned long *in_purr)
-{
-   ppc64_runlatch_off();
-   *in_purr = mfspr(SPRN_PURR);
-   /*
-* Indicate to the HV that we are idle. Now would be
-* a good time to find other work to dispatch.
-*/
-   get_lppaca()->idle = 1;
-}
-
-static inline void idle_loop_epilog(unsigned long in_purr)
-{
-   u64 wait_cycles;
-
-   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
-   wait_cycles += mfspr(SPRN_PURR) - in_purr;
-   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
-   get_lppaca()->idle = 0;
-
-   ppc64_runlatch_on();
-}
-
 static int snooze_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
@@ -63,7 +41,7 @@ static int snooze_loop(struct cpuidle_device *dev,
 
set_thread_flag(TIF_POLLING_NRFLAG);
 
-   idle_loop_prolog(&in_purr);
+   pseries_idle_prolog(&in_purr);
local_irq_enable();
snooze_exit_time = get_tb() + snooze_timeout;
 
@@ -87,7 +65,7 @@ static int snooze_loop(struct cpuidle_device *dev,
 
local_irq_disable();
 
-   idle_loop_epilog(in_purr);
+   pseries_idle_epilog(in_purr);
 
return index;
 }
@@ -115,7 +93,7 @@ static int dedicated_cede_loop(struct c

[PATCH v3 0/6] Track and expose idle PURR and SPURR ticks

2020-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

This is the third version of the patches to track and expose idle PURR
and SPURR ticks. These patches are required by tools such as lparstat
to compute system utilization for capacity planning purposes.

The previous versions can be found here:
v2: https://lkml.org/lkml/2020/2/21/21
v1: https://lore.kernel.org/patchwork/cover/1159341/

They key changes from v2 are:

- The prolog and epilog functions have been named
  pseries_idle_prolog() and pseries_idle_epilog() respectively to
  indicate their pseries specific nature.

- Fixed the Documentation for
  /sys/devices/system/cpu/cpuX/idle_spurr as pointed out by
  Nathan Lynch.

- Introduces a patch (Patch 6/6) to send an IPI in order to read
  and cache the values of purr, spurr, idle_purr and idle_spurr of
  the target CPU when any one of them is read via sysfs. These
  cached values will be presented if any of these sysfs are read
  within the next 10ms. If these sysfs files are read after 10ms
  from the earlier IPI, a fresh IPI is issued to read and cache
  the values again. This minimizes the number of IPIs required to
  be sent when these values are read back-to-back via the sysfs
  interface.

Test-results: While reading the four sysfs files back-to-back for a
given CPU every second for 100 seconds.

Without patch 6/6 (Without caching): 
 16 [XICS 2 Edge IPI] = 422 times
 DBL [Doorbell interrupts] = 13 times
 Total : 435 IPIs.

With patch 6/6 (With caching):
  16 [XICS 2 Edge IPI] = 111 times
  DBL [Doorbell interrupts] = 17 times
  Total : 128 IPIs.
Motivation:
===
On PSeries LPARs, the data centers planners desire a more accurate
view of system utilization per resource such as CPU to plan the system
capacity requirements better. Such accuracy can be obtained by reading
PURR/SPURR registers for CPU resource utilization.

Tools such as lparstat which are used to compute the utilization need
to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR
counters are already exposed through sysfs.  We already account for
PURR ticks when we go to idle so that we can update the VPA area. This
patchset extends support to account for SPURR ticks when idle, and
expose both via per-cpu sysfs files.

These patches are required for enhancement to the lparstat utility
that compute the CPU utilization based on PURR and SPURR which can be
found here :
https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4


With the patches, when lparstat is run on a LPAR running CPU-Hogs,
=
sudo ./src/lparstat -E 1 3
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=2 mem=4834176 kB cpus=0 ent=2.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
1  99.99   0.00  3.35GHz[111%] 110.99   0.00
2 100.00   0.00  3.35GHz[111%] 111.00   0.00
3 100.00   0.00  3.35GHz[111%] 111.00   0.00
=

When lparstat is run on an LPAR that is idle,
=
$ sudo ./src/lparstat -E 1 3
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=2 mem=4834176 kB cpus=0 ent=2.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
1   0.71  99.30  2.18GHz[ 72%]   0.53  71.48
2   0.56  99.44  2.11GHz[ 70%]   0.43  69.57
3   0.54  99.46  2.11GHz[ 70%]   0.43  69.57
=====
 

Gautham R. Shenoy (6):
  powerpc: Move idle_loop_prolog()/epilog() functions to header file
  powerpc/idle: Add accessor function to always read latest idle PURR
  powerpc/pseries: Account for SPURR ticks on idle CPUs
  powerpc/sysfs: Show idle_purr and idle_spurr for every CPU
  Documentation: Document sysfs interfaces purr, spurr, idle_purr, idle_spurr
  pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr

 Documentation/ABI/testing/sysfs-devices-system-cpu |  39 ++
 arch/powerpc/include/asm/idle.h|  89 ++
 arch/powerpc/kernel/sysfs.c| 133 +++--
 arch/powerpc/platforms/pseries/setup.c |   8 +-
 drivers/cpuidle/cpuidle-pseries.c  |  39 ++
 5 files changed, 267 insertions(+), 41 deletions(-)
 create mode 100644 arch/powerpc/include/asm/idle.h

-- 
1.9.4

[PATCH v3 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr

2020-03-11 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently purr, spurr, idle_purr, idle_spurr are exposed for every CPU
via the sysfs interface
/sys/devices/system/cpu/cpuX/[idle_][s]purr. Each sysfs read currently
generates an IPI to obtain the desired value from the target CPU X.
Since these aforementioned sysfs files are typically read one after
another, we end up generating 4 IPIs per CPU in a short duration.

In order to minimize the IPI noise, this patch caches the values of
all the four entities whenever one of them is read. If subsequently
any of these are read within the next 10ms, the cached value is
returned. With this, we will generate at most one IPI every 10ms for
every CPU.

Test-results: While reading the four sysfs files back-to-back for a
given CPU every second for 100 seconds.

Without the patch:
 16 [XICS 2 Edge IPI] = 422 times
 DBL [Doorbell interrupts] = 13 times
 Total : 435 IPIs.

With the patch:
  16 [XICS 2 Edge IPI] = 111 times
  DBL [Doorbell interrupts] = 17 times
  Total : 128 IPIs.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/sysfs.c | 109 
 1 file changed, 90 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index c9ddb83..db8fc90 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -586,8 +586,6 @@ void ppc_enable_pmcs(void)
  * SPRs which are not related to PMU.
  */
 #ifdef CONFIG_PPC64
-SYSFS_SPRSETUP(purr, SPRN_PURR);
-SYSFS_SPRSETUP(spurr, SPRN_SPURR);
 SYSFS_SPRSETUP(pir, SPRN_PIR);
 SYSFS_SPRSETUP(tscr, SPRN_TSCR);
 
@@ -596,8 +594,6 @@ void ppc_enable_pmcs(void)
   enable write when needed with a separate function.
   Lets be conservative and default to pseries.
 */
-static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
-static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
 static DEVICE_ATTR(pir, 0400, show_pir, NULL);
 static DEVICE_ATTR(tscr, 0600, show_tscr, store_tscr);
 #endif /* CONFIG_PPC64 */
@@ -761,39 +757,114 @@ static void create_svm_file(void)
 }
 #endif /* CONFIG_PPC_SVM */
 
-static void read_idle_purr(void *val)
+/*
+ * The duration (in ms) from the last IPI to the target CPU until
+ * which a cached value of purr, spurr, idle_purr, idle_spurr can be
+ * reported to the user on a corresponding sysfs file read. Beyond
+ * this duration, fresh values need to be obtained by sending IPIs to
+ * the target CPU when the sysfs files are read.
+ */
+static unsigned long util_stats_staleness_tolerance_ms = 10;
+struct util_acct_stats {
+   u64 latest_purr;
+   u64 latest_spurr;
+   u64 latest_idle_purr;
+   u64 latest_idle_spurr;
+   unsigned long last_update_jiffies;
+};
+
+DEFINE_PER_CPU(struct util_acct_stats, util_acct_stats);
+
+static void update_util_acct_stats(void *ptr)
 {
-   u64 *ret = val;
+   struct util_acct_stats *stats = ptr;
 
-   *ret = read_this_idle_purr();
+   stats->latest_purr = mfspr(SPRN_PURR);
+   stats->latest_spurr = mfspr(SPRN_SPURR);
+   stats->latest_idle_purr = read_this_idle_purr();
+   stats->latest_idle_spurr = read_this_idle_spurr();
+   stats->last_update_jiffies = jiffies;
 }
 
-static ssize_t idle_purr_show(struct device *dev,
- struct device_attribute *attr, char *buf)
+struct util_acct_stats *get_util_stats_ptr(int cpu)
+{
+   struct util_acct_stats *stats = per_cpu_ptr(&util_acct_stats, cpu);
+   unsigned long delta_jiffies;
+
+   delta_jiffies = jiffies - stats->last_update_jiffies;
+
+   /*
+* If we have a recent enough data, reuse that instead of
+* sending an IPI.
+*/
+   if (jiffies_to_msecs(delta_jiffies) < util_stats_staleness_tolerance_ms)
+   return stats;
+
+   smp_call_function_single(cpu, update_util_acct_stats, stats, 1);
+   return stats;
+}
+
+static ssize_t show_purr(struct device *dev,
+struct device_attribute *attr, char *buf)
 {
struct cpu *cpu = container_of(dev, struct cpu, dev);
-   u64 val;
+   struct util_acct_stats *stats;
 
-   smp_call_function_single(cpu->dev.id, read_idle_purr, &val, 1);
-   return sprintf(buf, "%llx\n", val);
+   stats = get_util_stats_ptr(cpu->dev.id);
+   return sprintf(buf, "%llx\n", stats->latest_purr);
 }
-static DEVICE_ATTR(idle_purr, 0400, idle_purr_show, NULL);
 
-static void read_idle_spurr(void *val)
+static void write_purr(void *val)
 {
-   u64 *ret = val;
+   mtspr(SPRN_PURR, *(unsigned long *)val);
+}
 
-   *ret = read_this_idle_spurr();
+static ssize_t __used store_purr(struct device *dev,
+struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct cpu *cpu = container_of(dev, struct cpu,

Re: [PATCH] cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_work_fn

2020-03-20 Thread Gautham R Shenoy

On Mon, Mar 16, 2020 at 07:27:43PM +0530, Pratik Rajesh Sampat wrote:
> The patch avoids allocating cpufreq_policy on stack hence fixing frame
> size overflow in 'powernv_cpufreq_work_fn'
>

Thanks for fixing this.

> Fixes: 227942809b52 ("cpufreq: powernv: Restore cpu frequency to policy->cur 
> on unthrottling")
> Signed-off-by: Pratik Rajesh Sampat 

Reviewed-by: Gautham R. Shenoy 

> ---
>  drivers/cpufreq/powernv-cpufreq.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 56f4bc0d209e..20ee0661555a 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -902,6 +902,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
>  void powernv_cpufreq_work_fn(struct work_struct *work)
>  {
>   struct chip *chip = container_of(work, struct chip, throttle);
> + struct cpufreq_policy *policy;
>   unsigned int cpu;
>   cpumask_t mask;
> 
> @@ -916,12 +917,14 @@ void powernv_cpufreq_work_fn(struct work_struct *work)
>   chip->restore = false;
>   for_each_cpu(cpu, &mask) {
>   int index;
> - struct cpufreq_policy policy;
> 
> - cpufreq_get_policy(&policy, cpu);
> - index = cpufreq_table_find_index_c(&policy, policy.cur);
> - powernv_cpufreq_target_index(&policy, index);
> - cpumask_andnot(&mask, &mask, policy.cpus);
> + policy = cpufreq_cpu_get(cpu);
> + if (!policy)
> + continue;
> + index = cpufreq_table_find_index_c(policy, policy->cur);
> + powernv_cpufreq_target_index(policy, index);
> + cpumask_andnot(&mask, &mask, policy->cpus);
> + cpufreq_cpu_put(policy);
>   }
>  out:
>   put_online_cpus();
> -- 
> 2.24.1
>

[PATCH v4 0/6] [PATCH v4 0/6] Track and expose idle PURR and SPURR ticks

2020-03-27 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

This is the fourth version of the patches to track and expose idle PURR
and SPURR ticks. These patches are required by tools such as lparstat
to compute system utilization for capacity planning purposes.

The previous versions can be found here:
v3: https://lkml.org/lkml/2020/3/11/331
v2: https://lkml.org/lkml/2020/2/21/21
v1: https://lore.kernel.org/patchwork/cover/1159341/

They key changes from v3 are:

   - Fixed the build errors on !CONFIG_PPC64 and !CONFIG_PPC_PSERIES
 configurations notified by the kbuild bot.


Motivation:
===
On PSeries LPARs, the data centers planners desire a more accurate
view of system utilization per resource such as CPU to plan the system
capacity requirements better. Such accuracy can be obtained by reading
PURR/SPURR registers for CPU resource utilization.

Tools such as lparstat which are used to compute the utilization need
to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR
counters are already exposed through sysfs.  We already account for
PURR ticks when we go to idle so that we can update the VPA area. This
patchset extends support to account for SPURR ticks when idle, and
expose both via per-cpu sysfs files.

This patch series also introduces a patch (Patch 6/6) to send an IPI
in order to read and cache the values of purr, spurr, idle_purr and
idle_spurr of the target CPU when any one of them is read via
sysfs. These cached values will be presented if any of these sysfs are
read within the next 10ms. If these sysfs files are read after 10ms
from the earlier IPI, a fresh IPI is issued to read and cache the
values again. This minimizes the number of IPIs required to be sent
when these values are read back-to-back via the sysfs interface.

Without patch 6/6 (Without caching): 
 16 [XICS 2 Edge IPI] = 422 times
 DBL [Doorbell interrupts] = 13 times
 Total : 435 IPIs.

With patch 6/6 (With caching):
  16 [XICS 2 Edge IPI] = 111 times
  DBL [Doorbell interrupts] = 17 times
  Total : 128 IPIs.

These patches are required for enhancement to the lparstat utility
that compute the CPU utilization based on PURR and SPURR which can be
found here :
https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4


With the patches, when lparstat is run on a LPAR running CPU-Hogs,
=
sudo ./src/lparstat -E 1 3
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=2 mem=4834176 kB cpus=0 ent=2.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
 99.99   0.00  3.35GHz[111%] 110.99   0.00
100.00   0.00  3.35GHz[111%] 111.00   0.00
100.00   0.00  3.35GHz[111%] 111.00   0.00

With patches, when lparstat is run on and idle LPAR
=
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
0.20  99.81  2.17GHz[ 72%]   0.19  71.82
0.42  99.58  2.11GHz[ 70%]   0.31  69.69
0.41  99.59  2.11GHz[ 70%]   0.31  69.69

Gautham R. Shenoy (6):
  powerpc: Move idle_loop_prolog()/epilog() functions to header file
  powerpc/idle: Add accessor function to always read latest idle PURR
  powerpc/pseries: Account for SPURR ticks on idle CPUs
  powerpc/sysfs: Show idle_purr and idle_spurr for every CPU
  Documentation: Document sysfs interfaces purr, spurr, idle_purr,
idle_spurr
  pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr

 Documentation/ABI/testing/sysfs-devices-system-cpu |  39 +
 arch/powerpc/include/asm/idle.h|  93 
 arch/powerpc/kernel/sysfs.c| 167 -
 arch/powerpc/platforms/pseries/setup.c |   8 +-
 drivers/cpuidle/cpuidle-pseries.c  |  39 +
 5 files changed, 305 insertions(+), 41 deletions(-)
 create mode 100644 arch/powerpc/include/asm/idle.h

-- 
1.9.4

[PATCH v4 2/6] powerpc/idle: Add accessor function to always read latest idle PURR

2020-03-27 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog().  Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.

However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this will
be done in a future patch), then, the value of the idle PURR thus read
will not include the cycles spent in the most recent idle period.

This patch addresses the issue by providing accessor function to read
the idle PURR such such that it includes the cycles spent in the most
recent idle period, if we read it between pseries_idle_prolog() and
pseries_idle_epilog(). In order to achieve it, the patch saves the
snapshot of PURR in pseries_idle_prolog() in a per-cpu variable,
instead of on the stack, so that it can be accessed from an interrupt
context.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 47 +++---
 arch/powerpc/platforms/pseries/setup.c |  7 +++--
 drivers/cpuidle/cpuidle-pseries.c  | 15 +--
 3 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
index 32064a4c..d4bfb6a 100644
--- a/arch/powerpc/include/asm/idle.h
+++ b/arch/powerpc/include/asm/idle.h
@@ -5,10 +5,27 @@
 #include 
 
 #ifdef CONFIG_PPC_PSERIES
-static inline void pseries_idle_prolog(unsigned long *in_purr)
+DECLARE_PER_CPU(u64, idle_entry_purr_snap);
+
+static inline void snapshot_purr_idle_entry(void)
+{
+   *this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR);
+}
+
+static inline void update_idle_purr_accounting(void)
+{
+   u64 wait_cycles;
+   u64 in_purr = *this_cpu_ptr(&idle_entry_purr_snap);
+
+   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
+   wait_cycles += mfspr(SPRN_PURR) - in_purr;
+   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+}
+
+static inline void pseries_idle_prolog(void)
 {
ppc64_runlatch_off();
-   *in_purr = mfspr(SPRN_PURR);
+   snapshot_purr_idle_entry();
/*
 * Indicate to the HV that we are idle. Now would be
 * a good time to find other work to dispatch.
@@ -16,16 +33,28 @@ static inline void pseries_idle_prolog(unsigned long 
*in_purr)
get_lppaca()->idle = 1;
 }
 
-static inline void pseries_idle_epilog(unsigned long in_purr)
+static inline void pseries_idle_epilog(void)
 {
-   u64 wait_cycles;
-
-   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
-   wait_cycles += mfspr(SPRN_PURR) - in_purr;
-   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+   update_idle_purr_accounting();
get_lppaca()->idle = 0;
-
ppc64_runlatch_on();
 }
+
+static inline u64 read_this_idle_purr(void)
+{
+   /*
+* If we are reading from an idle context, update the
+* idle-purr cycles corresponding to the last idle period.
+* Since the idle context is not yet over, take a fresh
+* snapshot of the idle-purr.
+*/
+   if (unlikely(get_lppaca()->idle == 1)) {
+   update_idle_purr_accounting();
+   snapshot_purr_idle_entry();
+   }
+
+   return be64_to_cpu(get_lppaca()->wait_state_cycles);
+}
+
 #endif /* CONFIG_PPC_PSERIES */
 #endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 2f53e6b..4905c96 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -318,10 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+DEFINE_PER_CPU(u64, idle_entry_purr_snap);
 static void pseries_lpar_idle(void)
 {
-   unsigned long in_purr;
-
/*
 * Default handler to go into low thread priority and possibly
 * low power mode by ceding processor to hypervisor
@@ -331,7 +330,7 @@ static void pseries_lpar_idle(void)
return;
 
/* Indicate to hypervisor that we are idle. */
-   pseries_idle_prolog(&in_purr);
+   pseries_idle_prolog();
 
/*
 * Yield the processor to the hypervisor.  We return if
@@ -342,7 +341,7 @@ static void pseries_lpar_idle(void)
 */
cede_processor();
 
-   pseries_idle_epilog(in_purr);
+   pseries_idle_epilog();
 }
 
 /*
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 46d5e05..6513ef2 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -36,12 +36,11 @@ static int snooze_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,

[PATCH v4 3/6] powerpc/pseries: Account for SPURR ticks on idle CPUs

2020-03-27 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

Via pseries_idle_prolog(), pseries_idle_epilog(), we track the idle
PURR ticks in the VPA variable "wait_state_cycles". This patch extends
the support to account for the idle SPURR ticks. It also provides an
accessor function to accurately reads idle SPURR ticks.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 33 +
 arch/powerpc/platforms/pseries/setup.c |  2 ++
 2 files changed, 35 insertions(+)

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
index d4bfb6a..accd1f5 100644
--- a/arch/powerpc/include/asm/idle.h
+++ b/arch/powerpc/include/asm/idle.h
@@ -5,13 +5,20 @@
 #include 
 
 #ifdef CONFIG_PPC_PSERIES
+DECLARE_PER_CPU(u64, idle_spurr_cycles);
 DECLARE_PER_CPU(u64, idle_entry_purr_snap);
+DECLARE_PER_CPU(u64, idle_entry_spurr_snap);
 
 static inline void snapshot_purr_idle_entry(void)
 {
*this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR);
 }
 
+static inline void snapshot_spurr_idle_entry(void)
+{
+   *this_cpu_ptr(&idle_entry_spurr_snap) = mfspr(SPRN_SPURR);
+}
+
 static inline void update_idle_purr_accounting(void)
 {
u64 wait_cycles;
@@ -22,10 +29,19 @@ static inline void update_idle_purr_accounting(void)
get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
 }
 
+static inline void update_idle_spurr_accounting(void)
+{
+   u64 *idle_spurr_cycles_ptr = this_cpu_ptr(&idle_spurr_cycles);
+   u64 in_spurr = *this_cpu_ptr(&idle_entry_spurr_snap);
+
+   *idle_spurr_cycles_ptr += mfspr(SPRN_SPURR) - in_spurr;
+}
+
 static inline void pseries_idle_prolog(void)
 {
ppc64_runlatch_off();
snapshot_purr_idle_entry();
+   snapshot_spurr_idle_entry();
/*
 * Indicate to the HV that we are idle. Now would be
 * a good time to find other work to dispatch.
@@ -36,6 +52,7 @@ static inline void pseries_idle_prolog(void)
 static inline void pseries_idle_epilog(void)
 {
update_idle_purr_accounting();
+   update_idle_spurr_accounting();
get_lppaca()->idle = 0;
ppc64_runlatch_on();
 }
@@ -56,5 +73,21 @@ static inline u64 read_this_idle_purr(void)
return be64_to_cpu(get_lppaca()->wait_state_cycles);
 }
 
+static inline u64 read_this_idle_spurr(void)
+{
+   /*
+* If we are reading from an idle context, update the
+* idle-spurr cycles corresponding to the last idle period.
+* Since the idle context is not yet over, take a fresh
+* snapshot of the idle-spurr.
+*/
+   if (get_lppaca()->idle == 1) {
+   update_idle_spurr_accounting();
+   snapshot_spurr_idle_entry();
+   }
+
+   return *this_cpu_ptr(&idle_spurr_cycles);
+}
+
 #endif /* CONFIG_PPC_PSERIES */
 #endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 4905c96..1b55e80 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -318,7 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+DEFINE_PER_CPU(u64, idle_spurr_cycles);
 DEFINE_PER_CPU(u64, idle_entry_purr_snap);
+DEFINE_PER_CPU(u64, idle_entry_spurr_snap);
 static void pseries_lpar_idle(void)
 {
/*
-- 
1.9.4

[PATCH v4 1/6] powerpc: Move idle_loop_prolog()/epilog() functions to header file

2020-03-27 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently prior to entering an idle state on a Linux Guest, the
pseries cpuidle driver implement an idle_loop_prolog() and
idle_loop_epilog() functions which ensure that idle_purr is correctly
computed, and the hypervisor is informed that the CPU cycles have been
donated.

These prolog and epilog functions are also required in the default
idle call, i.e pseries_lpar_idle(). Hence move these accessor
functions to a common header file and call them from
pseries_lpar_idle(). Since the existing header files such as
asm/processor.h have enough clutter, create a new header file
asm/idle.h. Finally rename idle_loop_prolog() and idle_loop_epilog()
to pseries_idle_prolog() and pseries_idle_epilog() as they are only
relavent for on pseries guests.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 31 +
 arch/powerpc/platforms/pseries/setup.c |  7 +--
 drivers/cpuidle/cpuidle-pseries.c  | 36 +++---
 3 files changed, 43 insertions(+), 31 deletions(-)
 create mode 100644 arch/powerpc/include/asm/idle.h

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
new file mode 100644
index 000..32064a4c
--- /dev/null
+++ b/arch/powerpc/include/asm/idle.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _ASM_POWERPC_IDLE_H
+#define _ASM_POWERPC_IDLE_H
+#include 
+#include 
+
+#ifdef CONFIG_PPC_PSERIES
+static inline void pseries_idle_prolog(unsigned long *in_purr)
+{
+   ppc64_runlatch_off();
+   *in_purr = mfspr(SPRN_PURR);
+   /*
+* Indicate to the HV that we are idle. Now would be
+* a good time to find other work to dispatch.
+*/
+   get_lppaca()->idle = 1;
+}
+
+static inline void pseries_idle_epilog(unsigned long in_purr)
+{
+   u64 wait_cycles;
+
+   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
+   wait_cycles += mfspr(SPRN_PURR) - in_purr;
+   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+   get_lppaca()->idle = 0;
+
+   ppc64_runlatch_on();
+}
+#endif /* CONFIG_PPC_PSERIES */
+#endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 0c8421d..2f53e6b 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -319,6 +320,8 @@ static int alloc_dispatch_log_kmem_cache(void)
 
 static void pseries_lpar_idle(void)
 {
+   unsigned long in_purr;
+
/*
 * Default handler to go into low thread priority and possibly
 * low power mode by ceding processor to hypervisor
@@ -328,7 +331,7 @@ static void pseries_lpar_idle(void)
return;
 
/* Indicate to hypervisor that we are idle. */
-   get_lppaca()->idle = 1;
+   pseries_idle_prolog(&in_purr);
 
/*
 * Yield the processor to the hypervisor.  We return if
@@ -339,7 +342,7 @@ static void pseries_lpar_idle(void)
 */
cede_processor();
 
-   get_lppaca()->idle = 0;
+   pseries_idle_epilog(in_purr);
 }
 
 /*
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 74c2479..46d5e05 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct cpuidle_driver pseries_idle_driver = {
@@ -31,29 +32,6 @@ struct cpuidle_driver pseries_idle_driver = {
 static u64 snooze_timeout __read_mostly;
 static bool snooze_timeout_en __read_mostly;
 
-static inline void idle_loop_prolog(unsigned long *in_purr)
-{
-   ppc64_runlatch_off();
-   *in_purr = mfspr(SPRN_PURR);
-   /*
-* Indicate to the HV that we are idle. Now would be
-* a good time to find other work to dispatch.
-*/
-   get_lppaca()->idle = 1;
-}
-
-static inline void idle_loop_epilog(unsigned long in_purr)
-{
-   u64 wait_cycles;
-
-   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
-   wait_cycles += mfspr(SPRN_PURR) - in_purr;
-   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
-   get_lppaca()->idle = 0;
-
-   ppc64_runlatch_on();
-}
-
 static int snooze_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
@@ -63,7 +41,7 @@ static int snooze_loop(struct cpuidle_device *dev,
 
set_thread_flag(TIF_POLLING_NRFLAG);
 
-   idle_loop_prolog(&in_purr);
+   pseries_idle_prolog(&in_purr);
local_irq_enable();
snooze_exit_time = get_tb() + snooze_timeout;
 
@@ -87,7 +65,7 @@ static int snooze_loop(struct cpuidle_device *dev,
 
local_irq_disable();
 
-   idle_loop_epilog(in_purr);
+   pseries_idle_epilog(in_purr);

[PATCH v4 5/6] Documentation: Document sysfs interfaces purr, spurr, idle_purr, idle_spurr

2020-03-27 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Add documentation for the following sysfs interfaces:
/sys/devices/system/cpu/cpuX/purr
/sys/devices/system/cpu/cpuX/spurr
/sys/devices/system/cpu/cpuX/idle_purr
/sys/devices/system/cpu/cpuX/idle_spurr

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 39 ++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 2e0e3b4..bc07677 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -580,3 +580,42 @@ Description:   Secure Virtual Machine
If 1, it means the system is using the Protected Execution
Facility in POWER9 and newer processors. i.e., it is a Secure
Virtual Machine.
+
+What:  /sys/devices/system/cpu/cpuX/purr
+Date:  Apr 2005
+Contact:   Linux for PowerPC mailing list 
+Description:   PURR ticks for this CPU since the system boot.
+
+   The Processor Utilization Resources Register (PURR) is
+   a 64-bit counter which provides an estimate of the
+   resources used by the CPU thread. The contents of this
+   register increases monotonically. This sysfs interface
+   exposes the number of PURR ticks for cpuX.
+
+What:  /sys/devices/system/cpu/cpuX/spurr
+Date:  Dec 2006
+Contact:   Linux for PowerPC mailing list 
+Description:   SPURR ticks for this CPU since the system boot.
+
+   The Scaled Processor Utilization Resources Register
+   (SPURR) is a 64-bit counter that provides a frequency
+   invariant estimate of the resources used by the CPU
+   thread. The contents of this register increases
+   monotonically. This sysfs interface exposes the number
+   of SPURR ticks for cpuX.
+
+What:  /sys/devices/system/cpu/cpuX/idle_purr
+Date:  Mar 2020
+Contact:   Linux for PowerPC mailing list 
+Description:   PURR ticks for cpuX when it was idle.
+
+   This sysfs interface exposes the number of PURR ticks
+   for cpuX when it was idle.
+
+What:  /sys/devices/system/cpu/cpuX/idle_spurr
+Date:  Mar 2020
+Contact:   Linux for PowerPC mailing list 
+Description:   SPURR ticks for cpuX when it was idle.
+
+   This sysfs interface exposes the number of SPURR ticks
+   for cpuX when it was idle.
-- 
1.9.4

[PATCH v4 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr

2020-03-27 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently purr, spurr, idle_purr, idle_spurr are exposed for every CPU
via the sysfs interface
/sys/devices/system/cpu/cpuX/[idle_][s]purr. Each sysfs read currently
generates an IPI to obtain the desired value from the target CPU X.
Since these aforementioned sysfs are typically read one after another,
we end up generating 4 IPIs per CPU in a short duration.

In order to minimize the IPI noise, this patch caches the values of
all the four entities whenever one of them is read. If subsequently
any of these are read within the next 10ms, the cached value is
returned. With this, we will generate at most one IPI every 10ms for
every CPU.

Test-results: While reading the four sysfs files back-to-back for a
given CPU every second for 100 seconds.

Without the patch:
 16 [XICS 2 Edge IPI] = 422 times
 DBL [Doorbell interrupts] = 13 times
 Total : 435 IPIs.

With the patch:
  16 [XICS 2 Edge IPI] = 111 times
  DBL [Doorbell interrupts] = 17 times
  Total : 128 IPIs.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/sysfs.c | 117 
 1 file changed, 97 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 571b325..bd92023 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -586,8 +586,6 @@ void ppc_enable_pmcs(void)
  * SPRs which are not related to PMU.
  */
 #ifdef CONFIG_PPC64
-SYSFS_SPRSETUP(purr, SPRN_PURR);
-SYSFS_SPRSETUP(spurr, SPRN_SPURR);
 SYSFS_SPRSETUP(pir, SPRN_PIR);
 SYSFS_SPRSETUP(tscr, SPRN_TSCR);
 
@@ -596,8 +594,6 @@ void ppc_enable_pmcs(void)
   enable write when needed with a separate function.
   Lets be conservative and default to pseries.
 */
-static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
-static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
 static DEVICE_ATTR(pir, 0400, show_pir, NULL);
 static DEVICE_ATTR(tscr, 0600, show_tscr, store_tscr);
 #endif /* CONFIG_PPC64 */
@@ -761,22 +757,110 @@ static void create_svm_file(void)
 }
 #endif /* CONFIG_PPC_SVM */
 
+#ifdef CONFIG_PPC64
+/*
+ * The duration (in ms) from the last IPI to the target CPU until
+ * which a cached value of purr, spurr, idle_purr, idle_spurr can be
+ * reported to the user on a corresponding sysfs file read. Beyond
+ * this duration, fresh values need to be obtained by sending IPIs to
+ * the target CPU when the sysfs files are read.
+ */
+static unsigned long util_stats_staleness_tolerance_ms = 10;
+struct util_acct_stats {
+   u64 latest_purr;
+   u64 latest_spurr;
+#ifdef CONFIG_PPC_PSERIES
+   u64 latest_idle_purr;
+   u64 latest_idle_spurr;
+#endif
+   unsigned long last_update_jiffies;
+};
+
+DEFINE_PER_CPU(struct util_acct_stats, util_acct_stats);
+
+static void update_util_acct_stats(void *ptr)
+{
+   struct util_acct_stats *stats = ptr;
+
+   stats->latest_purr = mfspr(SPRN_PURR);
+   stats->latest_spurr = mfspr(SPRN_SPURR);
 #ifdef CONFIG_PPC_PSERIES
-static void read_idle_purr(void *val)
+   stats->latest_idle_purr = read_this_idle_purr();
+   stats->latest_idle_spurr = read_this_idle_spurr();
+#endif
+   stats->last_update_jiffies = jiffies;
+}
+
+struct util_acct_stats *get_util_stats_ptr(int cpu)
+{
+   struct util_acct_stats *stats = per_cpu_ptr(&util_acct_stats, cpu);
+   unsigned long delta_jiffies;
+
+   delta_jiffies = jiffies - stats->last_update_jiffies;
+
+   /*
+* If we have a recent enough data, reuse that instead of
+* sending an IPI.
+*/
+   if (jiffies_to_msecs(delta_jiffies) < util_stats_staleness_tolerance_ms)
+   return stats;
+
+   smp_call_function_single(cpu, update_util_acct_stats, stats, 1);
+   return stats;
+}
+
+static ssize_t show_purr(struct device *dev,
+struct device_attribute *attr, char *buf)
 {
-   u64 *ret = val;
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   struct util_acct_stats *stats;
 
-   *ret = read_this_idle_purr();
+   stats = get_util_stats_ptr(cpu->dev.id);
+   return sprintf(buf, "%llx\n", stats->latest_purr);
 }
 
+static void write_purr(void *val)
+{
+   mtspr(SPRN_PURR, *(unsigned long *)val);
+}
+
+static ssize_t __used store_purr(struct device *dev,
+struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   unsigned long val;
+   int ret = kstrtoul(buf, 16, &val);
+
+   if (ret != 0)
+   return -EINVAL;
+
+   smp_call_function_single(cpu->dev.id, write_purr, &val, 1);
+   return count;
+}
+static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
+
+static ssize_t show_spurr(struct device *dev,
+

[PATCH v4 4/6] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2020-03-27 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

The total PURR and SPURR ticks are already exposed via the per-cpu
sysfs files "purr" and "spurr". This patch adds support for exposing
the idle PURR and SPURR ticks via new per-cpu sysfs files named
"idle_purr" and "idle_spurr".

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/sysfs.c | 82 +++--
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 479c706..571b325 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cacheinfo.h"
@@ -760,6 +761,74 @@ static void create_svm_file(void)
 }
 #endif /* CONFIG_PPC_SVM */
 
+#ifdef CONFIG_PPC_PSERIES
+static void read_idle_purr(void *val)
+{
+   u64 *ret = val;
+
+   *ret = read_this_idle_purr();
+}
+
+static ssize_t idle_purr_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   u64 val;
+
+   smp_call_function_single(cpu->dev.id, read_idle_purr, &val, 1);
+   return sprintf(buf, "%llx\n", val);
+}
+static DEVICE_ATTR(idle_purr, 0400, idle_purr_show, NULL);
+
+static void create_idle_purr_file(struct device *s)
+{
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_create_file(s, &dev_attr_idle_purr);
+}
+
+static void remove_idle_purr_file(struct device *s)
+{
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_remove_file(s, &dev_attr_idle_purr);
+}
+
+static void read_idle_spurr(void *val)
+{
+   u64 *ret = val;
+
+   *ret = read_this_idle_spurr();
+}
+
+static ssize_t idle_spurr_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   u64 val;
+
+   smp_call_function_single(cpu->dev.id, read_idle_spurr, &val, 1);
+   return sprintf(buf, "%llx\n", val);
+}
+static DEVICE_ATTR(idle_spurr, 0400, idle_spurr_show, NULL);
+
+static void create_idle_spurr_file(struct device *s)
+{
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_create_file(s, &dev_attr_idle_spurr);
+}
+
+static void remove_idle_spurr_file(struct device *s)
+{
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_remove_file(s, &dev_attr_idle_spurr);
+}
+
+#else /* CONFIG_PPC_PSERIES */
+#define create_idle_purr_file(s)
+#define remove_idle_purr_file(s)
+#define create_idle_spurr_file(s)
+#define remove_idle_spurr_file(s)
+#endif /* CONFIG_PPC_PSERIES */
+
 static int register_cpu_online(unsigned int cpu)
 {
struct cpu *c = &per_cpu(cpu_devices, cpu);
@@ -823,10 +892,13 @@ static int register_cpu_online(unsigned int cpu)
if (!firmware_has_feature(FW_FEATURE_LPAR))
add_write_permission_dev_attr(&dev_attr_purr);
device_create_file(s, &dev_attr_purr);
+   create_idle_purr_file(s);
}
 
-   if (cpu_has_feature(CPU_FTR_SPURR))
+   if (cpu_has_feature(CPU_FTR_SPURR)) {
device_create_file(s, &dev_attr_spurr);
+   create_idle_spurr_file(s);
+   }
 
if (cpu_has_feature(CPU_FTR_DSCR))
device_create_file(s, &dev_attr_dscr);
@@ -910,11 +982,15 @@ static int unregister_cpu_online(unsigned int cpu)
device_remove_file(s, &dev_attr_mmcra);
 #endif /* CONFIG_PMU_SYSFS */
 
-   if (cpu_has_feature(CPU_FTR_PURR))
+   if (cpu_has_feature(CPU_FTR_PURR)) {
device_remove_file(s, &dev_attr_purr);
+   remove_idle_purr_file(s);
+   }
 
-   if (cpu_has_feature(CPU_FTR_SPURR))
+   if (cpu_has_feature(CPU_FTR_SPURR)) {
device_remove_file(s, &dev_attr_spurr);
+   remove_idle_spurr_file(s);
+   }
 
if (cpu_has_feature(CPU_FTR_DSCR))
device_remove_file(s, &dev_attr_dscr);
-- 
1.9.4

[RFC/PATCH 1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.

2020-03-31 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

If a guest executes a stop instruction when the hypervisor has set the
PSSCR[ESL|EC] bits, the processor will throw an Hypervisor Facility
Unavailable exception. Currently when we receive this exception, we
only check if the exeception is generated due to a doorbell
instruction, in which case we emulate it. For all other cases,
including the case when the guest executes a stop-instruction, the
hypervisor sends a PROGILL to the guest program, which results in a
guest crash.

This patch adds code to handle the case when the hypervisor receives a
H_FAC_UNAVAIL exception due to guest executing the stop
instruction. The hypervisor increments the pc to the next instruction
and resumes the guest as expected by the semantics of the
PSSCR[ESL|EC] = 0 stop instruction.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/reg.h | 1 +
 arch/powerpc/kvm/book3s_hv.c   | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index da5cab0..2568c18 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -399,6 +399,7 @@
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_SCV_LG12  /* Enable System Call Vectored */
 #define FSCR_MSGP_LG   10  /* Enable MSGP */
+#define FSCR_STOP_LG9   /* Enable stop states */
 #define FSCR_TAR_LG8   /* Enable Target Address Register */
 #define FSCR_EBB_LG7   /* Enable Event Based Branching */
 #define FSCR_TM_LG 5   /* Enable Transactional Memory */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 33be4d9..cdb7224 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1419,7 +1419,11 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (((vcpu->arch.hfscr >> 56) == FSCR_MSGP_LG) &&
cpu_has_feature(CPU_FTR_ARCH_300))
r = kvmppc_emulate_doorbell_instr(vcpu);
-   if (r == EMULATE_FAIL) {
+   else if (((vcpu->arch.hfscr >> 56) == FSCR_STOP_LG) &&
+   cpu_has_feature(CPU_FTR_ARCH_300)) {
+   kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
+   r = RESUME_GUEST;
+   } else if (r == EMULATE_FAIL) {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
r = RESUME_GUEST;
}
-- 
1.9.4

[RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry

2020-03-31 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

ISA v3.0 allows the guest to execute a stop instruction. For this, the
PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
scheduling in the guest vCPU.

Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
set. This patch changes the behaviour to enter the guest with
PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
unconditionally clear these bits. Ideally this should be done
conditionally on platforms where the guest stop instruction has no
Bugs (starting POWER9 DD2.3).

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kvm/book3s_hv.c|  2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cdb7224..36d059a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
mtspr(SPRN_IC, vcpu->arch.ic);
mtspr(SPRN_PID, vcpu->arch.pid);
 
-   mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
+   mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
  (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 
mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index dbc2fec..c2daec3 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
mtspr   SPRN_PID, r7
mtspr   SPRN_WORT, r8
 BEGIN_FTR_SECTION
+   /* POWER9-only registers */
+   ld  r5, VCPU_TID(r4)
+   ld  r6, VCPU_PSSCR(r4)
+   lbz r8, HSTATE_FAKE_SUSPEND(r13)
+   lis r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
+   andcr6, r6, r7
+   rldimi  r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
+   ld  r7, VCPU_HFSCR(r4)
+   mtspr   SPRN_TIDR, r5
+   mtspr   SPRN_PSSCR, r6
+   mtspr   SPRN_HFSCR, r7
+FTR_SECTION_ELSE
/* POWER8-only registers */
ld  r5, VCPU_TCSCR(r4)
ld  r6, VCPU_ACOP(r4)
@@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
mtspr   SPRN_CSIGR, r7
mtspr   SPRN_TACR, r8
nop
-FTR_SECTION_ELSE
-   /* POWER9-only registers */
-   ld  r5, VCPU_TID(r4)
-   ld  r6, VCPU_PSSCR(r4)
-   lbz r8, HSTATE_FAKE_SUSPEND(r13)
-   orisr6, r6, PSSCR_EC@h  /* This makes stop trap to HV */
-   rldimi  r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
-   ld  r7, VCPU_HFSCR(r4)
-   mtspr   SPRN_TIDR, r5
-   mtspr   SPRN_PSSCR, r6
-   mtspr   SPRN_HFSCR, r7
-ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 8:
 
ld  r5, VCPU_SPRG0(r4)
-- 
1.9.4

[RFC/PATCH 0/3] Add support for stop instruction inside KVM guest

2020-03-31 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 


 *** RFC Only. Not intended for inclusion 
 
Motivation
~~~

The POWER ISA v3.0 allows stop instruction to be executed from a Guest
Kernel (HV=0,PR=0) context. If the hypervisor has cleared
PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause
the vCPU thread to "pause", thereby donating its cycles to the other
threads in the core until the paused thread is woken up by an
interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then
execution of the "stop" instruction will raise a Hypervisor Facility
Unavailable exception.

The stop idle state in the guest (henceforth referred to as stop0lite)
when enabled

* has a very small wakeup latency (1-3us) comparable to that of
  snooze and considerably better compared the Shared CEDE state
  (25-30us).  Results are provided below for wakeup latency measured
  by waking up an idle CPU in a given state using a timer as well as
  using an IPI.

  ==
  Wakeup Latency measured using a timer (in ns) [Lower is better]
  ==
  Idle state |  Nr samples |  Min| Max| Median | Avg   | Stddev|
  ==
  snooze |   60|  787| 1059   |  938   | 937.4 | 42.27 |
  ==
  stop0lite  |   60|  770| 1182   |  948   | 946.4 | 67.41 |
  ==
  Shared CEDE|   60| 9550| 36694  | 29219  |28564.1|3545.9 |
  ==

  ==
  Wakeup Latency measured using a timer (in ns) [Lower is better]
  ==
  Idle state |  Nr samples |  Min| Max| Median | Avg   | Stddev|
  ==
  snooze |   60|  787| 1059   |  938   | 937.4 | 42.27 |
  ==
  stop0lite  |   60|  770| 1182   |  948   | 946.4 | 67.41 |
  ==
  Shared CEDE|   60| 9550| 36694  | 29219  |28564.1|3545.9 |
  ==

* provides an improved single threaded performance compared to snooze
  since the idle state completely relinquishes the core cycles. The
  single threaded performance is observed to be better even when
  compared to "Shared CEDE", since in the latter case something else
  can scheduled on the ceded CPU, while "stop0lite" doesn't give up
  the CPU.

  On a KVM guest with smp 8,sockets=1,cores=2,threads=4 with vCPUs of
  a vCore bound to a physical core, we run a single-threaded ebizzy
  pinned to one of the guest vCPUs while the sibling vCPUs in the core
  are idling. We enable only one guest idle state at a time to measure
  the single-threaded performance benefit that the idle state provides
  by giving up the core resources to the non-idle thread. we obtain
  ~13% improvement in the throughput compared to that with "snooze"
  and ~8% improvement in the throughput compared to "Shared CEDE".
   
   ===
   | ebizzy records/s : [Higher the better]  |
   ===
   |Idle state |  Nr|  Min| Max| Median | Avg | Stddev   |
   |   |samples | ||| |  |
   ===  
   
   |snooze |   10   |  1378988| 1379358| 1379032|1379067.3|113.47|
   ===
   |stop0lite  |   10   |  1561836| 1562058| 1561906|1561927.5| 81.87|
   ===
   |Shared CEDE|   10   |  1446584| 1447383| 1447037|1447009.0|244.16|
   ===

Is stop0lite a replacement for snooze ?
~~~
Not yet. snooze is a polling state, and can respond much faster to a
need_resched() compared to stop0lite which needs an IPI to wakeup from
idle state. This can be seen in the results below:

With the context_switch2 pipe test, we can see that with stop0lite,
the number of context switches are 32.47% lesser than with
snooze. This is due to the fact that snooze is a polling state which
polls for TIF_NEED_RESCHED. Thus it does not require an interrupt t

[RFC/PATCH 3/3] cpuidle/pseries: Add stop0lite state

2020-03-31 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

The POWER ISA v3.0 allows stop instruction to be executed from a
HV=0,PR=0 context. If the PSSCR[ESL|EC] bits are cleared, then the
stop instruction thus executed will cause the thread to pause, thereby
donating its cycles to the other threads in the core until the paused
thread is woken up by an interrupt.

In this patch we define a cpuidle state for pseries guests named
stop0lite. This has a latency and residency intermediate to that of
snooze and CEDE. While snooze has non-existent latency, it consumes
the CPU cycles without contributing to anything useful. CEDE on the
other hand requires a full VM exit, which can result in some other
vCPU being scheduled on this physical CPU thereby delaying the
scheduling of the CEDEd vCPU back. In such cases, when the expected
idle duration is small (1-20us), the vCPU can go to this stop0lite
state which provides a nice intermediate state between snooze and
CEDE.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 74c2479..9c8c18d 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct cpuidle_driver pseries_idle_driver = {
.name = "pseries_idle",
@@ -170,6 +171,26 @@ static int shared_cede_loop(struct cpuidle_device *dev,
.enter = &dedicated_cede_loop },
 };
 
+
+
+static int stop_loop(struct cpuidle_device *dev,
+struct cpuidle_driver *drv,
+int index)
+{
+   unsigned long srr1 = 0;
+
+   if (!prep_irq_for_idle_irqsoff())
+   return index;
+
+   __ppc64_runlatch_off();
+   asm volatile("stop");
+   __ppc64_runlatch_on();
+   fini_irq_for_idle_irqsoff();
+   irq_set_pending_from_srr1(srr1);
+
+   return index;
+}
+
 /*
  * States for shared partition case.
  */
@@ -180,6 +201,12 @@ static int shared_cede_loop(struct cpuidle_device *dev,
.exit_latency = 0,
.target_residency = 0,
.enter = &snooze_loop },
+   { /* stop0_lite */
+   .name = "stop0lite",
+   .desc = "Pauses the CPU",
+   .exit_latency = 2,
+   .target_residency=20,
+   .enter = &stop_loop },
{ /* Shared Cede */
.name = "Shared Cede",
.desc = "Shared Cede",
-- 
1.9.4

Re: [RFC/PATCH 0/3] Add support for stop instruction inside KVM guest

2020-03-31 Thread Gautham R Shenoy

On Tue, Mar 31, 2020 at 05:40:55PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> 
>  *** RFC Only. Not intended for inclusion 
> 
> Motivation
> ~~~
> 
> The POWER ISA v3.0 allows stop instruction to be executed from a Guest
> Kernel (HV=0,PR=0) context. If the hypervisor has cleared
> PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause
> the vCPU thread to "pause", thereby donating its cycles to the other
> threads in the core until the paused thread is woken up by an
> interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then
> execution of the "stop" instruction will raise a Hypervisor Facility
> Unavailable exception.
> 
> The stop idle state in the guest (henceforth referred to as stop0lite)
> when enabled
> 
> * has a very small wakeup latency (1-3us) comparable to that of
>   snooze and considerably better compared the Shared CEDE state
>   (25-30us).  Results are provided below for wakeup latency measured
>   by waking up an idle CPU in a given state using a timer as well as
>   using an IPI.
> 
>   ==
>   Wakeup Latency measured using a timer (in ns) [Lower is better]
>   ==
>   Idle state |  Nr samples |  Min| Max| Median | Avg   | Stddev|
>   ==
>   snooze |   60|  787| 1059   |  938   | 937.4 | 42.27 |
>   ==
>   stop0lite  |   60|  770| 1182   |  948   | 946.4 | 67.41 |
>   ==
>   Shared CEDE|   60| 9550| 36694  | 29219  |28564.1|3545.9 |
>   ==
>

Posted two copies of Wakeup latency measured by timer. Here is the
wakeup latency measured using an IPI.


==
Wakeup latency measured using an IPI (in ns) [Lower is better]
==
Idle state |  Nr|  Min| Max| Median | Avg | Stddev   |
   |samples | ||| |  |
--
snooze |   60   | 2089|4228|2259|  2342.31|316.56|
--
stop0lite  |   60   | 1947|3674|2653|  2610.57|266.73|
--
Shared CEDE|   60   |20147|   36305|   21827| 26762.65|   6875.01|
==

--
Thanks and Regards
gautham.

Re: [PATCH v4 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr

2020-04-01 Thread Gautham R Shenoy

Hello Naveen,


On Wed, Apr 01, 2020 at 03:28:48PM +0530, Naveen N. Rao wrote:
> Gautham R. Shenoy wrote:
> >From: "Gautham R. Shenoy" 
> >
 [..snip..]

> >-static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
> >-static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
> > static DEVICE_ATTR(pir, 0400, show_pir, NULL);
> > static DEVICE_ATTR(tscr, 0600, show_tscr, store_tscr);
> > #endif /* CONFIG_PPC64 */
> >@@ -761,22 +757,110 @@ static void create_svm_file(void)
> > }
> > #endif /* CONFIG_PPC_SVM */
> >
> >+#ifdef CONFIG_PPC64
> >+/*
> >+ * The duration (in ms) from the last IPI to the target CPU until
> >+ * which a cached value of purr, spurr, idle_purr, idle_spurr can be
> >+ * reported to the user on a corresponding sysfs file read. Beyond
> >+ * this duration, fresh values need to be obtained by sending IPIs to
> >+ * the target CPU when the sysfs files are read.
> >+ */
> >+static unsigned long util_stats_staleness_tolerance_ms = 10;
> 
> This is a nice optimization for our use in lparstat, though I have a concern
> below.
> 
> >+struct util_acct_stats {
> >+u64 latest_purr;
> >+u64 latest_spurr;
> >+#ifdef CONFIG_PPC_PSERIES
> >+u64 latest_idle_purr;
> >+u64 latest_idle_spurr;
> >+#endif
> 
> You can probably drop the 'latest_' prefix.


Sure.

> 
> >+unsigned long last_update_jiffies;
> >+};
> >+
> >+DEFINE_PER_CPU(struct util_acct_stats, util_acct_stats);
> 
> Per snowpatch, this should be static, and so should get_util_stats_ptr()
> below:
> https://openpower.xyz/job/snowpatch/job/snowpatch-linux-sparse/16601//artifact/linux/report.txt

Ok, will fix this in v5.

> 
> >+
> >+static void update_util_acct_stats(void *ptr)
> >+{
> >+struct util_acct_stats *stats = ptr;
> >+
> >+stats->latest_purr = mfspr(SPRN_PURR);
> >+stats->latest_spurr = mfspr(SPRN_SPURR);
> > #ifdef CONFIG_PPC_PSERIES
> >-static void read_idle_purr(void *val)
> >+stats->latest_idle_purr = read_this_idle_purr();
> >+stats->latest_idle_spurr = read_this_idle_spurr();
> >+#endif
> >+stats->last_update_jiffies = jiffies;
> >+}
> >+
> >+struct util_acct_stats *get_util_stats_ptr(int cpu)
> >+{
> >+struct util_acct_stats *stats = per_cpu_ptr(&util_acct_stats, cpu);
> >+unsigned long delta_jiffies;
> >+
> >+delta_jiffies = jiffies - stats->last_update_jiffies;
> >+
> >+/*
> >+ * If we have a recent enough data, reuse that instead of
> >+ * sending an IPI.
> >+ */
> >+if (jiffies_to_msecs(delta_jiffies) < util_stats_staleness_tolerance_ms)
> >+return stats;
> >+
> >+smp_call_function_single(cpu, update_util_acct_stats, stats, 1);
> >+return stats;
> >+}
> >+
> >+static ssize_t show_purr(struct device *dev,
> >+ struct device_attribute *attr, char *buf)
> > {
> >-u64 *ret = val;
> >+struct cpu *cpu = container_of(dev, struct cpu, dev);
> >+struct util_acct_stats *stats;
> >
> >-*ret = read_this_idle_purr();
> >+stats = get_util_stats_ptr(cpu->dev.id);
> >+return sprintf(buf, "%llx\n", stats->latest_purr);
> 
> This alters the behavior of the current sysfs purr file. I am not sure if it
> is reasonable to return the same PURR value across a 10ms window.


It does reduce it to 10ms window. I am not sure if anyone samples PURR
etc faster than that rate.

I measured how much time it takes to read the purr, spurr, idle_purr,
idle_spurr files back-to-back. It takes not more than 150us.  From
lparstat will these values be read back-to-back ? If so, we can reduce
the staleness_tolerance to something like 500us and still avoid extra
IPIs. If not, what is the maximum delay between the first sysfs file
read and the last sysfs file read ?

>
> I wonder if we should introduce a sysctl interface to control thresholding.
> It can default to 0, which disables thresholding so that the existing
> behavior continues. Applications (lparstat) can optionally set it to suit
> their use.

We would be introducing 3 new sysfs interfaces that way instead of
two.

/sys/devices/system/cpu/purr_spurr_staleness
/sys/devices/system/cpu/cpuX/idle_purr
/sys/devices/system/cpu/cpuX/idle_spurr

I don't have a problem with this. Nathan, Michael, thoughts on this?


The alternative is to have a procfs interface, something like
/proc/powerpc/resource_util_stats

which gives a listing similar to /proc/stat, i.e

  CPUX

Even in this case, the values can be obtained in one-shot with a
single IPI and be printed in the row corresponding to the CPU.

> 
> - Naveen
> 

--
Thanks and Regards
gautham.

Re: [PATCH v4 2/6] powerpc/idle: Add accessor function to always read latest idle PURR

2020-04-02 Thread Gautham R Shenoy

On Wed, Apr 01, 2020 at 03:12:53PM +0530, Naveen N. Rao wrote:
> Hi Gautham,
> 
> Gautham R. Shenoy wrote:
> >From: "Gautham R. Shenoy" 
> >
> >Currently when CPU goes idle, we take a snapshot of PURR via
> >pseries_idle_prolog() which is used at the CPU idle exit to compute
> >the idle PURR cycles via the function pseries_idle_epilog().  Thus,
> >the value of idle PURR cycle thus read before pseries_idle_prolog() and
> >after pseries_idle_epilog() is always correct.
> >
> >However, if we were to read the idle PURR cycles from an interrupt
> >context between pseries_idle_prolog() and pseries_idle_epilog() (this will
> >be done in a future patch), then, the value of the idle PURR thus read
> >will not include the cycles spent in the most recent idle period.
> >
> >This patch addresses the issue by providing accessor function to read
> >the idle PURR such such that it includes the cycles spent in the most
> >recent idle period, if we read it between pseries_idle_prolog() and
> >pseries_idle_epilog(). In order to achieve it, the patch saves the
> >snapshot of PURR in pseries_idle_prolog() in a per-cpu variable,
> >instead of on the stack, so that it can be accessed from an interrupt
> >context.
> >
> >Signed-off-by: Gautham R. Shenoy 
> >---
> > arch/powerpc/include/asm/idle.h| 47 
> > +++---
> > arch/powerpc/platforms/pseries/setup.c |  7 +++--
> > drivers/cpuidle/cpuidle-pseries.c  | 15 +--
> > 3 files changed, 47 insertions(+), 22 deletions(-)
> >
> >diff --git a/arch/powerpc/include/asm/idle.h 
> >b/arch/powerpc/include/asm/idle.h
> >index 32064a4c..d4bfb6a 100644
> >--- a/arch/powerpc/include/asm/idle.h
> >+++ b/arch/powerpc/include/asm/idle.h
> >@@ -5,10 +5,27 @@
> > #include 
> >
> > #ifdef CONFIG_PPC_PSERIES
> >-static inline void pseries_idle_prolog(unsigned long *in_purr)
> >+DECLARE_PER_CPU(u64, idle_entry_purr_snap);
> >+
> >+static inline void snapshot_purr_idle_entry(void)
> >+{
> >+*this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR);
> >+}
> >+
> >+static inline void update_idle_purr_accounting(void)
> >+{
> >+u64 wait_cycles;
> >+u64 in_purr = *this_cpu_ptr(&idle_entry_purr_snap);
> >+
> >+wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
> >+wait_cycles += mfspr(SPRN_PURR) - in_purr;
> >+get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
> >+}
> >+
> >+static inline void pseries_idle_prolog(void)
> > {
> > ppc64_runlatch_off();
> >-*in_purr = mfspr(SPRN_PURR);
> >+snapshot_purr_idle_entry();
> > /*
> >  * Indicate to the HV that we are idle. Now would be
> >  * a good time to find other work to dispatch.
> >@@ -16,16 +33,28 @@ static inline void pseries_idle_prolog(unsigned long 
> >*in_purr)
> > get_lppaca()->idle = 1;
> > }
> >
> >-static inline void pseries_idle_epilog(unsigned long in_purr)
> >+static inline void pseries_idle_epilog(void)
> > {
> >-u64 wait_cycles;
> >-
> >-wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
> >-wait_cycles += mfspr(SPRN_PURR) - in_purr;
> >-get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
> >+update_idle_purr_accounting();
> > get_lppaca()->idle = 0;
> >-
> > ppc64_runlatch_on();
> > }
> >+
> >+static inline u64 read_this_idle_purr(void)
> >+{
> >+/*
> >+ * If we are reading from an idle context, update the
> >+ * idle-purr cycles corresponding to the last idle period.
> >+ * Since the idle context is not yet over, take a fresh
> >+ * snapshot of the idle-purr.
> >+ */
> >+if (unlikely(get_lppaca()->idle == 1)) {
> >+update_idle_purr_accounting();
> >+snapshot_purr_idle_entry();
> >+}
> >+
> >+return be64_to_cpu(get_lppaca()->wait_state_cycles);
> >+}
> >+
> 
> I think this and read_this_idle_spurr() from the next patch should be moved
> to Patch 4/6, where they are actually used.

The reason I included this function in this patch was to justify why
we were introducing snapshotting the purr values in a global per-cpu
variable instead of on a stack variable. The reason being that someone
might want to read the PURR value from an interrupt context which had
woken up the CPU from idle. At this point, since epilog() function
wasn't called, the idle PURR count corresponding to

Re: [PATCH v4 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr

2020-04-02 Thread Gautham R Shenoy

Hi Naveen,

On Thu, Apr 02, 2020 at 01:04:34PM +0530, Naveen N. Rao wrote:
[..snip..]

> >
> >It does reduce it to 10ms window. I am not sure if anyone samples PURR
> >etc faster than that rate.
> >
> >I measured how much time it takes to read the purr, spurr, idle_purr,
> >idle_spurr files back-to-back. It takes not more than 150us.  From
> >lparstat will these values be read back-to-back ? If so, we can reduce
> >the staleness_tolerance to something like 500us and still avoid extra
> >IPIs. If not, what is the maximum delay between the first sysfs file
> >read and the last sysfs file read ?
> 
> Oh, for lparstat usage, this is perfectly fine.
> 
> I meant that there could be other users of [s]purr who might care. I don't
> know of one, but since this is an existing sysfs interface, I wanted to
> point out that the behavior might change.

Fair point. Perhaps this should be documented in the Documentation, if
we are going to continue with this patch.

> 
> >
> >>
> >>I wonder if we should introduce a sysctl interface to control thresholding.
> >>It can default to 0, which disables thresholding so that the existing
> >>behavior continues. Applications (lparstat) can optionally set it to suit
> >>their use.
> >
> >We would be introducing 3 new sysfs interfaces that way instead of
> >two.
> >
> >/sys/devices/system/cpu/purr_spurr_staleness
> >/sys/devices/system/cpu/cpuX/idle_purr
> >/sys/devices/system/cpu/cpuX/idle_spurr
> >
> >I don't have a problem with this. Nathan, Michael, thoughts on this?
> >
> >
> >The alternative is to have a procfs interface, something like
> >/proc/powerpc/resource_util_stats
> >
> >which gives a listing similar to /proc/stat, i.e
> >
> >  CPUX
> >
> >Even in this case, the values can be obtained in one-shot with a
> >single IPI and be printed in the row corresponding to the CPU.
> 
> Right -- and that would be optimal requiring a single system call, at the
> cost of using a legacy interface.
> 
> The other option would be to drop this patch and to just go with patches 1-5
> introducing the new sysfs interfaces for idle_[s]purr. It isn't entirely
> clear how often this would be used, or its actual impact. We can perhaps
> consider this optimization if and when this causes problems...

I am ok with that. We can revisit the problem if IPI noise becomes
noticable. However, if Nathan or Michael feel that this problem is
better solved now, than leaving it for the future, we will have to
take a call on what the interface is going to be.

> 
> 
> Thanks,
> Naveen
> 

--
Thanks and Regards
gautham.

Re: [PATCH v4 2/6] powerpc/idle: Add accessor function to always read latest idle PURR

2020-04-05 Thread Gautham R Shenoy

On Fri, Apr 03, 2020 at 04:04:56PM +0530, Naveen N. Rao wrote:
> Gautham R Shenoy wrote:
> >On Wed, Apr 01, 2020 at 03:12:53PM +0530, Naveen N. Rao wrote:
> >>Hi Gautham,
> >>
> >>Gautham R. Shenoy wrote:
> >>>From: "Gautham R. Shenoy" 
> >>>
> >>>+
> >>>+static inline u64 read_this_idle_purr(void)
> >>>+{
> >>>+  /*
> >>>+   * If we are reading from an idle context, update the
> >>>+   * idle-purr cycles corresponding to the last idle period.
> >>>+   * Since the idle context is not yet over, take a fresh
> >>>+   * snapshot of the idle-purr.
> >>>+   */
> >>>+  if (unlikely(get_lppaca()->idle == 1)) {
> >>>+  update_idle_purr_accounting();
> >>>+  snapshot_purr_idle_entry();
> >>>+  }
> >>>+
> >>>+  return be64_to_cpu(get_lppaca()->wait_state_cycles);
> >>>+}
> >>>+
> >>
> >>I think this and read_this_idle_spurr() from the next patch should be moved
> >>to Patch 4/6, where they are actually used.
> >
> >The reason I included this function in this patch was to justify why
> >we were introducing snapshotting the purr values in a global per-cpu
> >variable instead of on a stack variable. The reason being that someone
> >might want to read the PURR value from an interrupt context which had
> >woken up the CPU from idle. At this point, since epilog() function
> >wasn't called, the idle PURR count corresponding to this latest idle
> >period would have been accumulated in lppaca->wait_cycles. Thus, this
> >helper function safely reads the value by
> >   1) First updating the lppaca->wait_cycles with the latest idle_purr
> >   count.
> >   2) Take a fresh snapshot, since the time from now to the epilog()
> >   call is also counted under idle CPU. So the PURR cycle increment
> >   during this short period should also be accumulated in 
> > lppaca->wait_cycles.
> >
> >
> >prolog()
> >|snapshot PURR
> >|
> >|
> >|
> >Idle
> >|
> >| <- Interrupt . Read idle PURR  update idle PURR;
> >| snapshot PURR;
> >| Read idle PURR.   |
> >epilog()
> > update idle PURR
> >
> 
> Yes, I understand. It makes sense.
> 
> >
> >However, if you feel that moving this function to Patch 4 where it is
> >actually used makes it more readable, I can do that.
> 
> My suggestion was from a bisectability standpoint though. This is a fairly
> simple function, but it is generally recommended to ensure that newly added
> code gets exercized in the patch that it is introduced in:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/5.Posting.rst#n119
>

Fair point. Will move those functions to Patch 4.


> 
> Regards,
> Naveen
>

Re: [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry

2020-04-05 Thread Gautham R Shenoy

On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > From: "Gautham R. Shenoy" 
> > 
> > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > scheduling in the guest vCPU.
> > 
> > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > set. This patch changes the behaviour to enter the guest with
> > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > unconditionally clear these bits. Ideally this should be done
> > conditionally on platforms where the guest stop instruction has no
> > Bugs (starting POWER9 DD2.3).
> 
> How will guests know that they can use this facility safely after your
> series? You need both DD2.3 and a patched KVM.


Yes, this is something that isn't addressed in this series (mentioned
in the cover letter), which is a POC demonstrating that the stop0lite
state in guest works.

However, to answer your question, this is the scheme that I had in
mind :

OPAL:
   On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"

Hypervisor Kernel:
1. If "idle-stop-guest" dt-cpu-feature is discovered, then
   we set bool enable_guest_stop = true;

2. During KVM guest entry, clear PSSCR[ESL|EC] iff
   enable_guest_stop == true.

3. In kvm_vm_ioctl_check_extension(), for a new capability
   KVM_CAP_STOP, return true iff enable_guest_top == true.

QEMU:
   Check with the hypervisor if KVM_CAP_STOP is present. If so,
   indicate the presence to the guest via device tree.

Guest Kernel:
   Check for the presence of guest stop state support in
   device-tree. If available, enable the stop0lite in the cpuidle
   driver. 
   

We still have a challenge of migrating a guest which started on a
hypervisor supporting guest stop state to a hypervisor without it.
The target hypervisor should atleast have Patch 1 of this series, so
that we don't crash the guest.

> 
> > 
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> >  arch/powerpc/kvm/book3s_hv.c|  2 +-
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +
> >  2 files changed, 14 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index cdb7224..36d059a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
> > *vcpu, u64 time_limit,
> > mtspr(SPRN_IC, vcpu->arch.ic);
> > mtspr(SPRN_PID, vcpu->arch.pid);
> >  
> > -   mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > +   mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> >   (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> >  
> > mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index dbc2fec..c2daec3 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> > mtspr   SPRN_PID, r7
> > mtspr   SPRN_WORT, r8
> >  BEGIN_FTR_SECTION
> > +   /* POWER9-only registers */
> > +   ld  r5, VCPU_TID(r4)
> > +   ld  r6, VCPU_PSSCR(r4)
> > +   lbz r8, HSTATE_FAKE_SUSPEND(r13)
> > +   lis r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > +   andcr6, r6, r7
> > +   rldimi  r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > +   ld  r7, VCPU_HFSCR(r4)
> > +   mtspr   SPRN_TIDR, r5
> > +   mtspr   SPRN_PSSCR, r6
> > +   mtspr   SPRN_HFSCR, r7
> > +FTR_SECTION_ELSE
> 
> Why did you move these around? Just because the POWER9 section became
> larger than the other?

Yes.

> 
> That's a real wart in the instruction patching implementation, I think
> we can fix it by padding with nops in the macros.
> 
> Can you just add the additional required nops to the top branch without
> changing them around for this patch, so it's easier to see what's going
> on? The end result will be the same after patching. Actually changing
> these around can have a slight unintended consequence in that code that
> runs before features were patched will execute the IF code. Not a
> problem here, but another reason why the instruction patching 
> restriction is annoying.

Sure, I will repost this patch with additional nops instead of
moving them around.

&

[PATCH v2 0/1] pseries/hotplug: Change the default behaviour of cede_offline

2019-10-22 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

This is the v2 of the fix to change the default behaviour of cede_offline.
The previous version can be found here: https://lkml.org/lkml/2019/9/12/222

The main change from v1 is that the patch2 to create a sysfs file to
report and control the value of cede_offline_enabled has been dropped.

Problem Description:

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since the CPUs are
still under the control fo the Linux Guests. Hence, when we change the
SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
will continue to count the values for offlined CPUs in extended cede
as if they are online.

One of the expectations with PURR is that the for an interval of time,
the sum of the PURR increments across the online CPUs of a core should
equal the number of timebase ticks for that interval.

This is currently not the case.

In the following data (Generated using
https://github.com/gautshen/misc/blob/master/purr_tb.py):

SD-PURR = Sum of PURR increments on online CPUs of that core in 1 second
  
SMT=off
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0]51200   69883784
core01 [  8]51200   88782536
core02 [ 16]51200   94296824
core03 [ 24]51200   80951968

SMT=2
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1]  51200   136147792   
core01 [  8,9]  51200   128636784   
core02 [ 16,17] 51200   135426488   
core03 [ 24,25] 51200   153027520   

SMT=4
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3]  51200   258331616   
core01 [  8,9,10,11]51200   274220072   
core02 [ 16,17,18,19]   51200   260013736   
core03 [ 24,25,26,27]   51200   260079672   

SMT=on
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3,4,5,6,7]  51200   512941248   
core01 [  8,9,10,11,12,13,14,15]51200   512936544   
core02 [ 16,17,18,19,20,21,22,23]   51200   512931544   
core03 [ 24,25,26,27,28,29,30,31]   51200   512923800

This patchset addresses this issue by ensuring that by default, the
offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call
by changing the default value of "cede_offline_enabled" to false.

With the patches, we see that the observed value of the sum of the
PURR increments across the the online threads of a core in 1-second
matches the number of tb-ticks in 1-second.

SMT=off
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0]51200512527568  
core01 [  8]51200512556128  
core02 [ 16]51200512590016  
core03 [ 24]51200512589440  

SMT=2
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1]  51200   512635328
core01 [  8,9]  51200   512610416   
core02 [ 16,17] 51200   512639360   
core03 [ 24,25] 51200   512638720   

SMT=4
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3]  51200   512757328   
core01 [  8,9,10,11]51200   512727920   
core02 [ 16,17,18,19]   51200   512754712   
core03 [ 24,25,26,27]   51200   512739040   

SMT=on
==
Core   SD-PURR SD-PURR

[PATCH v2 1/1] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"

2019-10-22 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently on PSeries Linux guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since they are still
under the control fo the Linux guests. Hence, when we change the SMT
modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will
continue to count the values for offlined CPUs in extended cede as if
they are online. This breaks the accounting in tools such as lparstat.

To fix this, ensure that by default the offlined CPUs are returned to
the hypervisor via RTAS "stop-self" call by changing the default value
of "cede_offline_enabled" to false.

Fixes: commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU
into an appropriate offline state")

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/core-api/cpu_hotplug.rst   |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/core-api/cpu_hotplug.rst 
b/Documentation/core-api/cpu_hotplug.rst
index 4a50ab7..5319593 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -53,7 +53,7 @@ Command Line Switches
 ``cede_offline={"off","on"}``
   Use this option to disable/enable putting offlined processors to an extended
   ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
-  ``cede_offline`` is set to "on".
+  ``cede_offline`` is set to "off".
 
   This option is limited to the PowerPC architecture.
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index bbda646..f9d0366 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -46,7 +46,17 @@ static DEFINE_PER_CPU(enum cpu_state_vals, 
preferred_offline_state) =
 
 static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
 
-static bool cede_offline_enabled __read_mostly = true;
+/*
+ * Determines whether the offlined CPUs should be put to a long term
+ * processor cede (called extended cede) for power-saving
+ * purposes. The CPUs in extended cede are still with the Linux Guest
+ * and are not returned to the Hypervisor.
+ *
+ * By default, the offlined CPUs are returned to the hypervisor via
+ * RTAS "stop-self". This behaviour can be changed by passing the
+ * kernel commandline parameter "cede_offline=on".
+ */
+static bool cede_offline_enabled __read_mostly;
 
 /*
  * Enable/disable cede_offline when available.
-- 
1.9.4

1 2 3 4 5 6 7 >

1 - 100 of 626 matches

Mail list logo