Re: [PATCH net-next] powerpc: use asm-generic/socket.h as much as possible

2017-05-31 Thread Arnd Bergmann
On Wed, May 31, 2017 at 7:43 AM, Stephen Rothwell  wrote:
> asm-generic/socket.h already has an exception for the differences that
> powerpc needs, so just include it after defining the differences.
>
> Signed-off-by: Stephen Rothwell 

Acked-by: Arnd Bergmann 


Re: [PATCH 6/6] cpuidle-powernv: Allow Deep stop states that don't stop time

2017-05-31 Thread Gautham R Shenoy
On Tue, May 30, 2017 at 09:10:06PM +1000, Nicholas Piggin wrote:
> On Tue, 30 May 2017 16:20:55 +0530
> Gautham R Shenoy  wrote:
> 
> > On Tue, May 30, 2017 at 05:13:57PM +1000, Nicholas Piggin wrote:
> > > On Tue, 16 May 2017 14:19:48 +0530
> > > "Gautham R. Shenoy"  wrote:
> > >   
> > > > From: "Gautham R. Shenoy" 
> > > > 
> > > > The current code in the cpuidle-powernv intialization only allows deep
> > > > stop states (indicated by OPAL_PM_STOP_INST_DEEP) which lose timebase
> > > > (indicated by OPAL_PM_TIMEBASE_STOP). This assumption goes back to
> > > > POWER8 time where deep states used to lose the timebase. However, on
> > > > POWER9, we do have stop states that are deep (they lose hypervisor
> > > > state) but retain the timebase.
> > > > 
> > > > Fix the initialization code in the cpuidle-powernv driver to allow
> > > > such deep states.
> > > > 
> > > > Further, there is a bug in cpuidle-powernv driver with
> > > > CONFIG_TICK_ONESHOT=n where we end up incrementing the nr_idle_states
> > > > even if a platform idle state which loses time base was not added to
> > > > the cpuidle table.
> > > > 
> > > > Fix this by ensuring that the nr_idle_states variable gets incremented
> > > > only when the platform idle state was added to the cpuidle table.  
> > > 
> > > Should this be a separate patch? Stable?  
> > 
> > Ok. Will send it out separately.
> 
> Looks like mpe has merged this in next now. I just wonder if this
> particular bit would be relevant for POWER8 and therefore be a
> stable candidate? All the POWER9 idle fixes may not be suitable for
> stable.

I agree. The other POWER9 fixes aren't suitable for stable. I will
clean this patch alone based on your suggestion and mark it for
stable.

> 
> 
> > > > Signed-off-by: Gautham R. Shenoy 
> > > > ---
> > > >  drivers/cpuidle/cpuidle-powernv.c | 16 ++--
> > > >  1 file changed, 10 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> > > > b/drivers/cpuidle/cpuidle-powernv.c
> > > > index 12409a5..45eaf06 100644
> > > > --- a/drivers/cpuidle/cpuidle-powernv.c
> > > > +++ b/drivers/cpuidle/cpuidle-powernv.c
> > > > @@ -354,6 +354,7 @@ static int powernv_add_idle_states(void)
> > > >  
> > > > for (i = 0; i < dt_idle_states; i++) {
> > > > unsigned int exit_latency, target_residency;
> > > > +   bool stops_timebase = false;
> > > > /*
> > > >  * If an idle state has exit latency beyond
> > > >  * POWERNV_THRESHOLD_LATENCY_NS then don't use it
> > > > @@ -381,6 +382,9 @@ static int powernv_add_idle_states(void)
> > > > }
> > > > }
> > > >  
> > > > +   if (flags[i] & OPAL_PM_TIMEBASE_STOP)
> > > > +   stops_timebase = true;
> > > > +
> > > > /*
> > > >  * For nap and fastsleep, use default target_residency
> > > >  * values if f/w does not expose it.
> > > > @@ -392,8 +396,7 @@ static int powernv_add_idle_states(void)
> > > > add_powernv_state(nr_idle_states, "Nap",
> > > >   CPUIDLE_FLAG_NONE, nap_loop,
> > > >   target_residency, 
> > > > exit_latency, 0, 0);
> > > > -   } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) &&
> > > > -   !(flags[i] & OPAL_PM_TIMEBASE_STOP)) {
> > > > +   } else if (has_stop_states && !stops_timebase) {
> > > > add_powernv_state(nr_idle_states, names[i],
> > > >   CPUIDLE_FLAG_NONE, stop_loop,
> > > >   target_residency, 
> > > > exit_latency,
> > > > @@ -405,8 +408,8 @@ static int powernv_add_idle_states(void)
> > > >  * within this config dependency check.
> > > >  */
> > > >  #ifdef CONFIG_TICK_ONESHOT
> > > > -   if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
> > > > -   flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
> > > > +   else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
> > > > +flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {  
> > > 
> > > Hmm, seems okay but readability is isn't the best with the ifdef and
> > > mixing power8 and 9 cases IMO.
> > > 
> > > Particularly with the nice regular POWER9 states, we're not doing much
> > > logic in this loop besides checking for the timebase stop flag, right?
> > > Would it be clearer if it was changed to something like this (untested
> > > quick hack)?  
> > 
> > Yes, this is very much doable. Some comments below.
> > 
> > > 
> > > ---
> > >  drivers/cpuidle/cpuidle-powernv.c | 76 
> > > +++
> > >  1 file changed, 37 insertions(+), 39 deletions(-)
> > > 
> > > diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> > > b/drivers/cpuidle/cpuidle-powern

Re: [PATCH net-next] powerpc: use asm-generic/socket.h as much as possible

2017-05-31 Thread Michael Ellerman
Stephen Rothwell  writes:

> asm-generic/socket.h already has an exception for the differences that
> powerpc needs, so just include it after defining the differences.
>
> Signed-off-by: Stephen Rothwell 
> ---
>  arch/powerpc/include/uapi/asm/socket.h | 92 
> +-
>  1 file changed, 1 insertion(+), 91 deletions(-)
>
> Build tested using powerpc allyesconfig, pseries_le_defconfig, 32 bit
> and 64 bit allnoconfig and ppc44x_defconfig builds.

Did you boot it and test that userspace was happy doing sockety things?

cheers


Re: [PATCH] rtc/tpo: Handle disabled TPO in opal_get_tpo_time()

2017-05-31 Thread Alexandre Belloni
On 19/05/2017 at 15:35:09 +0530, Vaibhav Jain wrote:
> On PowerNV platform when Timed-Power-On(TPO) is disabled, read of
> stored TPO yields value with all date components set to '0' inside
> opal_get_tpo_time(). The function opal_to_tm() then converts it to an
> offset from year 1900 yielding alarm-time == "1900-00-01
> 00:00:00". This causes problems with __rtc_read_alarm() that
> expecting an offset from "1970-00-01 00:00:00" and returned alarm-time
> results in a -ve value for time64_t. Which ultimately results in this
> error reported in kernel logs with a seemingly garbage value:
> 
> "rtc rtc0: invalid alarm value: -2-1--1041528741
> 200557:71582844:32"
> 
> We fix this by explicitly handling the case of all alarm date-time
> components being '0' inside opal_get_tpo_time() and returning -ENOENT
> in such a case. This signals generic rtc that no alarm is set and it
> bails out from the alarm initialization flow without reporting the
> above error.
> 
> Signed-off-by: Vaibhav Jain 
> Reported-by: Steve Best 
> ---
>  drivers/rtc/rtc-opal.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
Applied, thanks.

-- 
Alexandre Belloni, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com


Re: [PATCH] drivers/rtc/interface.c: Validate alarm-time before handling rollover

2017-05-31 Thread Alexandre Belloni
On 19/05/2017 at 22:18:55 +0530, Vaibhav Jain wrote:
> In function __rtc_read_alarm() its possible for an alarm time-stamp to
> be invalid even after replacing missing components with current
> time-stamp. The condition 'alarm->time.tm_year < 70' will trigger this
> case and will cause the call to 'rtc_tm_to_time64(&alarm->time)'
> return a negative value for variable t_alm.
> 
> While handling alarm rollover this negative t_alm (assumed to seconds
> offset from '1970-01-01 00:00:00') is converted back to rtc_time via
> rtc_time64_to_tm() which results in this error log with seemingly
> garbage values:
> 
> "rtc rtc0: invalid alarm value: -2-1--1041528741
> 200557:71582844:32"
> 
> This error was generated when the rtc driver (rtc-opal in this case)
> returned an alarm time-stamp of '00-00-00 00:00:00' to indicate that
> the alarm is disabled. Though I have submitted a separate fix for the
> rtc-opal driver, this issue may potentially impact other
> existing/future rtc drivers.
> 
> To fix this issue the patch validates the alarm time-stamp just after
> filling up the missing datetime components and if rtc_valid_tm() still
> reports it to be invalid then bails out of the function without
> handling the rollover.
> 
> Reported-by: Steve Best 
> Signed-off-by: Vaibhav Jain 
> ---
>  drivers/rtc/interface.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
Applied, thanks.

-- 
Alexandre Belloni, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com


Re: [PATCH net-next] powerpc: use asm-generic/socket.h as much as possible

2017-05-31 Thread Stephen Rothwell
Hi Michael,

On Wed, 31 May 2017 20:15:55 +1000 Michael Ellerman  wrote:
>
> Stephen Rothwell  writes:
> 
> > asm-generic/socket.h already has an exception for the differences that
> > powerpc needs, so just include it after defining the differences.
> >
> > Signed-off-by: Stephen Rothwell 
> > ---
> >  arch/powerpc/include/uapi/asm/socket.h | 92 
> > +-
> >  1 file changed, 1 insertion(+), 91 deletions(-)
> >
> > Build tested using powerpc allyesconfig, pseries_le_defconfig, 32 bit
> > and 64 bit allnoconfig and ppc44x_defconfig builds.  
> 
> Did you boot it and test that userspace was happy doing sockety things?

No, sorry.

The patch was done by inspection, but it is pretty obvious ... here is
the diff between arch/powerpc/include/uapi/asm/socket.h and
include/uapi/asm-generic/socket.h before the patch:

--- arch/powerpc/include/uapi/asm/socket.h  2017-05-31 20:56:54.940473709 
+1000
+++ include/uapi/asm-generic/socket.h   2017-05-31 10:04:16.716445463 +1000
@@ -1,12 +1,5 @@
-#ifndef _ASM_POWERPC_SOCKET_H
-#define _ASM_POWERPC_SOCKET_H
-
-/*
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
+#ifndef __ASM_GENERIC_SOCKET_H
+#define __ASM_GENERIC_SOCKET_H
 
 #include 
 
@@ -30,12 +23,14 @@
 #define SO_LINGER  13
 #define SO_BSDCOMPAT   14
 #define SO_REUSEPORT   15
-#define SO_RCVLOWAT16
-#define SO_SNDLOWAT17
-#define SO_RCVTIMEO18
-#define SO_SNDTIMEO19
-#define SO_PASSCRED20
-#define SO_PEERCRED21
+#ifndef SO_PASSCRED /* powerpc only differs in these */
+#define SO_PASSCRED16
+#define SO_PEERCRED17
+#define SO_RCVLOWAT18
+#define SO_SNDLOWAT19
+#define SO_RCVTIMEO20
+#define SO_SNDTIMEO21
+#endif
 
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 22
@@ -71,7 +66,7 @@
 #define SO_RXQ_OVFL 40
 
 #define SO_WIFI_STATUS 41
-#define SCM_WIFI_STATUSSO_WIFI_STATUS
+#define SCM_WIFI_STATUSSO_WIFI_STATUS
 #define SO_PEEK_OFF42
 
 /* Instruct lower device to use last 4-bytes of skb data as FCS */
@@ -107,4 +102,4 @@
 
 #define SCM_TIMESTAMPING_PKTINFO   58
 
-#endif /* _ASM_POWERPC_SOCKET_H */
+#endif /* __ASM_GENERIC_SOCKET_H */

-- 
Cheers,
Stephen Rothwell


Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

2017-05-31 Thread Michael Bringmann


On 05/29/2017 12:32 AM, Michael Ellerman wrote:
> Reza Arbab  writes:
> 
>> On Fri, May 26, 2017 at 01:46:58PM +1000, Michael Ellerman wrote:
>>> Reza Arbab  writes:
>>>
 On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
> The commit message for 3af229f2071f says:
>
>In practice, we never see a system with 256 NUMA nodes, and in fact, we
>do not support node hotplug on power in the first place, so the nodes
>^^^
>that are online when we come up are the nodes that will be present for
>the lifetime of this kernel.
>
> Is that no longer true?

 I don't know what the reasoning behind that statement was at the time,
 but as far as I can tell, the only thing missing for node hotplug now is
 Balbir's patchset [1]. He fixes the resource issue which motivated
 3af229f2071f and reverts it.

 With that set, I can instantiate a new numa node just by doing
 add_memory(nid, ...) where nid doesn't currently exist.
>>>
>>> But does that actually happen on any real system?
>>
>> I don't know if anything currently tries to do this. My interest in 
>> having this working is so that in the future, our coherent gpu memory 
>> could be added as a distinct node by the device driver.
> 
> Sure. If/when that happens, we would hopefully still have some way to
> limit the size of the possible map.
> 
> That would ideally be a firmware property that tells us the maximum
> number of GPUs that might be hot-added, or we punt and cap it at some
> "sane" maximum number.
> 
> But until that happens it's silly to say we can have up to 256 nodes
> when in practice most of our systems have 8 or less.
> 
> So I'm still waiting for an explanation from Michael B on how he's
> seeing this bug in practice.

I already answered this in an earlier message.  I will give an example.

* Let there be a configuration with nodes (0, 4-5, 8) that boots with 1 VP
  and 10G of memory in a shared processor configuration.
* At boot time, 4 nodes are put into the possible map by the PowerPC boot
  code.
* Subsequently, the NUMA code executes and puts the 10G memory into nodes
  4 & 5.  No memory goes into Node 0.  So we now have 2 nodes in the
  node_online_map.
* The VP and its threads get assigned to Node 4.
* Then when 'initmem_init()' in 'powerpc/numa.c' executes the instruction,
 node_and(node_possible_map, node_possible_map, node_online_map);
  the content of the node_possible_map is reduced to nodes 4-5.
* Later on we hot-add 90G of memory to the system.  It tries to put the
  memory into nodes 0, 4-5, 8 based on the memory association map.  We
  should see memory put into all 4 nodes.  However, since we have reduced
  the 'node_possible_map' to only nodes 4 & 5, we can now only put memory
  into 2 of the configured nodes.

# We want to be able to put memory into all 4 nodes via hot-add operations,
  not only the nodes that 'survive' boot time initialization.  We could
  make a number of changes to ensure that all of the nodes in the initial
  configuration provided by the pHyp can be used, but this one appears to
  be the simplest, only using resources requested by the pHyp at boot --
  even if those resource are not used immediately.

> 
> cheers
> 

Regards,
Michael

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Christoph Lameter
On Tue, 30 May 2017, Hugh Dickins wrote:

> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.

CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
be able to enable it at runtime.

> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.

SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.

Why are the slab allocators used to create slab caches for large object
sizes?

> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.

I thought you had these huge 64k page sizes?



Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Christoph Lameter
On Wed, 31 May 2017, Michael Ellerman wrote:

> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default 
> > order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.

Ahh. Ok debugging increased the object size to an order 4. This should be
order 3 without debugging.

> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.

I am curious as to what is going on there. Do you have the output from
these failed allocations?


Re: [v3 0/9] parallelized "struct page" zeroing

2017-05-31 Thread Michal Hocko
On Tue 30-05-17 13:16:50, Pasha Tatashin wrote:
> >Could you be more specific? E.g. how are other stores done in
> >__init_single_page safe then? I am sorry to be dense here but how does
> >the full 64B store differ from other stores done in the same function.
> 
> Hi Michal,
> 
> It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb)
> without membar, but they are slower compared to STBI which require a membar
> before memory can be accessed.

OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
would do memset. You said it would be slower but would that be
measurable? I am sorry to be so persistent here but I would be really
happier if this didn't depend on the deferred initialization. If this is
absolutely a no-go then I can live with that of course.

-- 
Michal Hocko
SUSE Labs


Re: [v3 0/9] parallelized "struct page" zeroing

2017-05-31 Thread David Miller
From: Michal Hocko 
Date: Wed, 31 May 2017 18:31:31 +0200

> On Tue 30-05-17 13:16:50, Pasha Tatashin wrote:
>> >Could you be more specific? E.g. how are other stores done in
>> >__init_single_page safe then? I am sorry to be dense here but how does
>> >the full 64B store differ from other stores done in the same function.
>> 
>> Hi Michal,
>> 
>> It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb)
>> without membar, but they are slower compared to STBI which require a membar
>> before memory can be accessed.
> 
> OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
> would do memset. You said it would be slower but would that be
> measurable? I am sorry to be so persistent here but I would be really
> happier if this didn't depend on the deferred initialization. If this is
> absolutely a no-go then I can live with that of course.

It is measurable.  That's the impetus for this work in the first place.

When the do the memory barrier, the whole store buffer flushes because
the memory barrier is done with a dependency on the next load or store
operation, one of which the caller is going to do immediately.


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Hugh Dickins
[ Merging two mails into one response ]

On Wed, 31 May 2017, Christoph Lameter wrote:
> On Tue, 30 May 2017, Hugh Dickins wrote:
> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default 
> > order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
> 
> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.
> 
> I am curious as to what is going on there. Do you have the output from
> these failed allocations?

I thought the relevant output was in my mail.  I did skip the Mem-Info
dump, since that just seemed noise in this case: we know memory can get
fragmented.  What more output are you looking for?

> 
> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> > it seemed to be a hard requirement for something, but I didn't find what.
> 
> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
> be able to enable it at runtime.

Yes, I thought so.

> 
> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> > the expected order:3, which then results in OOM-killing rather than direct
> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> > makes no real difference to the outcome: swapping loads still abort early.
> 
> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
> 
> Ahh. Ok debugging increased the object size to an order 4. This should be
> order 3 without debugging.

But it was still order 4 when booted with slub_debug=O, which surprised me.
And that surprises you too?  If so, then we ought to dig into it further.

> 
> Why are the slab allocators used to create slab caches for large object
> sizes?

There may be more optimal ways to allocate, but I expect that when
the ppc guys are writing the code to handle both 4k and 64k page sizes,
kmem caches offer the best span of possibility without complication.

> 
> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> > with 4k pages would do better not to expect to support a 128TB userspace.
> 
> I thought you had these huge 64k page sizes?

ppc64 does support 64k page sizes, and they've been the default for years;
but since 4k pages are still supported, I choose to use those (I doubt
I could ever get the same load going with 64k pages).

Hugh


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Mathieu Malaterre
On Wed, May 31, 2017 at 8:44 PM, Hugh Dickins  wrote:
> [ Merging two mails into one response ]
>
> On Wed, 31 May 2017, Christoph Lameter wrote:
>> On Tue, 30 May 2017, Hugh Dickins wrote:
>> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default 
>> > order: 4, min order: 4
>> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>>
>> > I did try booting with slub_debug=O as the message suggested, but that
>> > made no difference: it still hoped for but failed on order:4 allocations.
>>
>> I am curious as to what is going on there. Do you have the output from
>> these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?
>
>>
>> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
>> > it seemed to be a hard requirement for something, but I didn't find what.
>>
>> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
>> be able to enable it at runtime.
>
> Yes, I thought so.
>
>>
>> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
>> > the expected order:3, which then results in OOM-killing rather than direct
>> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
>> > makes no real difference to the outcome: swapping loads still abort early.
>>
>> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
>>
>> Ahh. Ok debugging increased the object size to an order 4. This should be
>> order 3 without debugging.
>
> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.
>
>>
>> Why are the slab allocators used to create slab caches for large object
>> sizes?
>
> There may be more optimal ways to allocate, but I expect that when
> the ppc guys are writing the code to handle both 4k and 64k page sizes,
> kmem caches offer the best span of possibility without complication.
>
>>
>> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
>> > with 4k pages would do better not to expect to support a 128TB userspace.
>>
>> I thought you had these huge 64k page sizes?
>
> ppc64 does support 64k page sizes, and they've been the default for years;
> but since 4k pages are still supported, I choose to use those (I doubt
> I could ever get the same load going with 64k pages).

4k is pretty much required on ppc64 when it comes to nouveau:

https://bugs.freedesktop.org/show_bug.cgi?id=94757

2cts


Re: [v3 0/9] parallelized "struct page" zeroing

2017-05-31 Thread Pasha Tatashin

OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
would do memset. You said it would be slower but would that be
measurable? I am sorry to be so persistent here but I would be really
happier if this didn't depend on the deferred initialization. If this is
absolutely a no-go then I can live with that of course.


Hi Michal,

This is actually a very good idea. I just did some measurements, and it 
looks like performance is very good.


Here is data from SPARC-M7 with 3312G memory with single thread performance:

Current:
memset() in memblock allocator takes: 8.83s
__init_single_page() take: 8.63s

Option 1:
memset() in __init_single_page() takes: 61.09s (as we discussed because 
of membar overhead, memset should really be optimized to do STBI only 
when size is 1 page or bigger).


Option 2:

8 stores (stx) in __init_single_page(): 8.525s!

So, even for single thread performance we can double the initialization 
speed of "struct page" on SPARC by removing memset() from memblock, and 
using 8 stx in __init_single_page(). It appears we never miss L1 in 
__init_single_page() after the initial 8 stx.


I will update patches with memset() on other platforms, and stx on SPARC.

My experimental code looks like this:

static void __meminit __init_single_page(struct page *page, unsigned 
long pfn, unsigned long zone, int nid)

{
__asm__ __volatile__(
"stx%%g0, [%0 + 0x00]\n"
"stx%%g0, [%0 + 0x08]\n"
"stx%%g0, [%0 + 0x10]\n"
"stx%%g0, [%0 + 0x18]\n"
"stx%%g0, [%0 + 0x20]\n"
"stx%%g0, [%0 + 0x28]\n"
"stx%%g0, [%0 + 0x30]\n"
"stx%%g0, [%0 + 0x38]\n"
:
:"r"(page));
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);

INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
}

Thank you,
Pasha


Re: [PATCH] powerpc/powernv: Enable PCI peer-to-peer

2017-05-31 Thread kbuild test robot
Hi Frederic,

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.12-rc3 next-20170531]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Frederic-Barrat/powerpc-powernv-Enable-PCI-peer-to-peer/20170531-035613
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allmodconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

>> arch/powerpc/platforms/powernv/pci-ioda.c:1411:13: error: static declaration 
>> of 'pnv_pci_ioda2_set_bypass' follows non-static declaration
static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
^~~~
   In file included from arch/powerpc/platforms/powernv/pci-ioda.c:48:0:
   arch/powerpc/platforms/powernv/pci.h:233:13: note: previous declaration of 
'pnv_pci_ioda2_set_bypass' was here
extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
^~~~

vim +/pnv_pci_ioda2_set_bypass +1411 arch/powerpc/platforms/powernv/pci-ioda.c

ee8222fe Wei Yang 2015-10-22  1405  
pnv_pci_vf_release_m64(pdev, num_vfs);
781a868f Wei Yang 2015-03-25  1406  return -EBUSY;
781a868f Wei Yang 2015-03-25  1407  }
781a868f Wei Yang 2015-03-25  1408  
c035e37b Alexey Kardashevskiy 2015-06-05  1409  static long 
pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
c035e37b Alexey Kardashevskiy 2015-06-05  1410  int num);
c035e37b Alexey Kardashevskiy 2015-06-05 @1411  static void 
pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
c035e37b Alexey Kardashevskiy 2015-06-05  1412  
781a868f Wei Yang 2015-03-25  1413  static void 
pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
781a868f Wei Yang 2015-03-25  1414  {

:: The code at line 1411 was first introduced by commit
:: c035e37b58c75ca216bfd1d5de3c1080ac0022b9 powerpc/powernv/ioda2: Use new 
helpers to do proper cleanup on PE release

:: TO: Alexey Kardashevskiy 
:: CC: Michael Ellerman 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Aneesh Kumar K.V
Hugh Dickins  writes:

> Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> I find that swapping loads on ppc64 on G5 with 4k pages are failing:
>
> SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 
> 4, min order: 4
>   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>   node 0: slabs: 209, objs: 209, free: 8
> gcc: page allocation failure: order:4, 
> mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> Call Trace:
> [c090b5c0] [c04f8478] .dump_stack+0xa0/0xcc (unreliable)
> [c090b650] [c00eb194] .warn_alloc+0xf0/0x178
> [c090b710] [c00ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> [c090b8b0] [c013921c] .new_slab+0x234/0x608
> [c090b980] [c013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> [c090bad0] [c04f5a84] 
> .__slab_alloc.isra.61.constprop.63+0x54/0x70
> [c090bb70] [c013b864] .kmem_cache_alloc+0x140/0x288
> [c090bc30] [c004d934] .mm_init.isra.65+0x128/0x1c0
> [c090bcc0] [c0157810] .do_execveat_common.isra.39+0x294/0x690
> [c090bdb0] [c0157e70] .SyS_execve+0x28/0x38
> [c090be30] [c000a118] system_call+0x38/0xfc
>
> I did try booting with slub_debug=O as the message suggested, but that
> made no difference: it still hoped for but failed on order:4 allocations.
>
> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.
>
> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.
>
> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.
>
> I tried the obvious partial revert below, but it's not good enough:
> the system did not boot beyond
>
> Starting init: /sbin/init exists but couldn't execute it (error -7)
> Starting init: /bin/sh exists but couldn't execute it (error -7)
> Kernel panic - not syncing: No working init found. ...
>

Can you try this patch.

commit fc55c0dc8b23446f937c1315aa61e74673de5ee6
Author: Aneesh Kumar K.V 
Date:   Thu Jun 1 08:06:40 2017 +0530

powerpc/mm/4k: Limit 4k page size to 64TB

Supporting 512TB requires us to do a order 3 allocation for level 1 page
table(pgd). Limit 4k to 64TB for now.

Signed-off-by: Aneesh Kumar K.V 

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b671ca..0c4e470571ca 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  9
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE   (sizeof(pte_t) << H_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index a2123f291ab0..5de3271026f1 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -110,13 +110,15 @@ void release_thread(struct task_struct *);
 #define TASK_SIZE_128TB (0x8000UL)
 #define TASK_SIZE_512TB (0x0002UL)
 
-#ifdef CONFIG_PPC_BOOK3S_64
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
 /*
  * Max value currently used:
  */
-#define TASK_SIZE_USER64   TASK_SIZE_512TB
+#define TASK_SIZE_USER64   TASK_SIZE_512TB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_128TB
 #else
-#define TASK_SIZE_USER64   TASK_SIZE_64TB
+#define TASK_SIZE_USER64   TASK_SIZE_64TB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_64TB
 #endif
 
 /*
@@ -132,7 +134,7 @@ void release_thread(struct task_struct *);
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4))
-#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_128TB / 4))
+#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(DEFAULT_MAP_WINDOW_USER64 / 4))
 
 #define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \
TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
@@ -143,8 +145,8 @@ void release_thread(struct task_struct *);
  * with 128TB and conditionally enable upto 512TB
  */
 #ifdef CONFIG_PPC_BOOK3S_64
-#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ? \
-TASK_SIZE_USER32 : TASK_SIZE_128TB)
+#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ?\
+  

[PATCH v3 1/3] powerpc/powernv: Add config option for removal of memory

2017-05-31 Thread Rashmica Gupta
This patch adds the config option to enable the removal
of memory from the kernel mappings at runtime. This needs
to be enabled for the hardware trace macro to work.

Signed-off-by: Rashmica Gupta 
---
v2 -> v3: Better description

 arch/powerpc/platforms/powernv/Kconfig  | 8 
 arch/powerpc/platforms/powernv/Makefile | 1 +
 2 files changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 6a6f4ef..92493d6 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -30,3 +30,11 @@ config OPAL_PRD
help
  This enables the opal-prd driver, a facility to run processor
  recovery diagnostics on OpenPower machines
+
+config PPC64_HARDWARE_TRACING
+   bool "Enable removal of RAM from kernel mappings for tracing"
+   help
+ Enabling this option allows for the removal of memory (RAM)
+ from the kernel mappings to be used for hardware tracing.
+   depends on MEMORY_HOTREMOVE
+   default n
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb..8fb026d 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -12,3 +12,4 @@ obj-$(CONFIG_PPC_SCOM)+= opal-xscom.o
 obj-$(CONFIG_MEMORY_FAILURE)   += opal-memory-errors.o
 obj-$(CONFIG_TRACEPOINTS)  += opal-tracepoints.o
 obj-$(CONFIG_OPAL_PRD) += opal-prd.o
+obj-$(CONFIG_PPC64_HARDWARE_TRACING)   += memtrace.o
-- 
2.9.3



[PATCH v3 2/3] powerpc/powernv: Enable removal of memory for in memory tracing

2017-05-31 Thread Rashmica Gupta
The hardware trace macro feature requires access to a chunk of real
memory. This patch provides a debugfs interface to do this. By
writing an integer containing the size of memory to be unplugged into
/sys/kernel/debug/powerpc/memtrace/enable, the code will attempt to
remove that much memory from the end of each NUMA node.

This patch also adds additional debugsfs files for each node that
allows the tracer to interact with the removed memory, as well as
a trace file that allows userspace to read the generated trace.

Note that this patch does not invoke the hardware trace macro, it
only allows memory to be removed during runtime for the trace macro
to utilise.

Signed-off-by: Rashmica Gupta 
---
v2 -> v3 : - Some changes required to compile with 4.12-rc3.
- Iterating from end of node rather than the start.
- As io_remap_pfn_range is defined as remap_pfn_range, just use 
remap_pfn_range.
- Removed the creation of the node debugsfs file as it had no use.

 arch/powerpc/platforms/powernv/memtrace.c | 289 ++
 1 file changed, 289 insertions(+)
 create mode 100644 arch/powerpc/platforms/powernv/memtrace.c

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
new file mode 100644
index 000..21fa2e4
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -0,0 +1,289 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) IBM Corporation, 2014
+ *
+ * Author: Anton Blanchard 
+ */
+
+#define pr_fmt(fmt) "powernv-memtrace: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* This enables us to keep track of the memory removed from each node. */
+struct memtrace_entry {
+   void *mem;
+   u64 start;
+   u64 size;
+   u32 nid;
+   struct dentry *dir;
+   char name[16];
+};
+
+static struct memtrace_entry *memtrace_array;
+static unsigned int memtrace_array_nr;
+
+static ssize_t memtrace_read(struct file *filp, char __user *ubuf,
+size_t count, loff_t *ppos)
+{
+   struct memtrace_entry *ent = filp->private_data;
+
+   return simple_read_from_buffer(ubuf, count, ppos, ent->mem, ent->size);
+}
+
+static bool valid_memtrace_range(struct memtrace_entry *dev,
+unsigned long start, unsigned long size)
+{
+   if ((start >= dev->start) &&
+   ((start + size) <= (dev->start + dev->size)))
+   return true;
+
+   return false;
+}
+
+static int memtrace_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   unsigned long size = vma->vm_end - vma->vm_start;
+   struct memtrace_entry *dev = filp->private_data;
+
+   if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, size))
+   return -EINVAL;
+
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+   if (remap_pfn_range(vma, vma->vm_start,
+  vma->vm_pgoff + (dev->start >> PAGE_SHIFT),
+  size, vma->vm_page_prot))
+   return -EAGAIN;
+
+   return 0;
+}
+
+static const struct file_operations memtrace_fops = {
+   .llseek = default_llseek,
+   .read   = memtrace_read,
+   .mmap   = memtrace_mmap,
+   .open   = simple_open,
+};
+
+static void flush_memory_region(u64 base, u64 size)
+{
+   unsigned long line_size = ppc64_caches.l1d.size;
+   u64 end = base + size;
+   u64 addr;
+
+   base = round_down(base, line_size);
+   end = round_up(end, line_size);
+
+   for (addr = base; addr < end; addr += line_size)
+   asm volatile("dcbf 0,%0" : "=r" (addr) :: "memory");
+}
+
+static int check_memblock_online(struct memory_block *mem, void *arg)
+{
+   if (mem->state != MEM_ONLINE)
+   return -1;
+
+   return 0;
+}
+
+static int change_memblock_state(struct memory_block *mem, void *arg)
+{
+   unsigned long state = (unsigned long)arg;
+
+   mem->state = state;
+   return 0;
+}
+
+static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
+{
+   u64 end_pfn = start_pfn + nr_pages - 1;
+
+   if (walk_memory_range(start_pfn, end_pfn, NULL,
+   check_memblock_online))
+   return false;
+
+   walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
+ change_memblock_state);
+
+   if (offline_pages(start_pfn, nr_pages)

[PATCH v3 3/3] Add documentation for the powerpc memtrace debugfs files

2017-05-31 Thread Rashmica Gupta
CONFIG_PPC64_HARDWARE_TRACING must be set to use this feature. This can only
be used on powernv platforms.

Signed-off-by: Rashmica Gupta 
---
 Documentation/ABI/testing/ppc-memtrace | 45 ++
 1 file changed, 45 insertions(+)
 create mode 100644 Documentation/ABI/testing/ppc-memtrace

diff --git a/Documentation/ABI/testing/ppc-memtrace 
b/Documentation/ABI/testing/ppc-memtrace
new file mode 100644
index 000..f7eff02
--- /dev/null
+++ b/Documentation/ABI/testing/ppc-memtrace
@@ -0,0 +1,45 @@
+What:  /sys/kernel/debug/powerpc/memtrace
+Date:  May 2017
+KernelVersion: 4.13?
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   This folder contains the relevant debugfs files for the
+   hardware trace macro to use. CONFIG_PPC64_HARDWARE_TRACING
+   must be set.
+
+What:  /sys/kernel/debug/powerpc/memtrace/enable
+Date:  May 2017
+KernelVersion: 4.13?
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   Write an integer containing the size of the memory you want
+   removed from each NUMA node to this file - it must be
+   aligned to the memblock size. This amount of RAM will be
+   removed from the kernel mappings and the following debugfs
+   files will be created. This can only be successfully done
+   once per boot. Once memory is successfully removed from
+   each node, the following files are created.
+
+What:  /sys/kernel/debug/powerpc/memtrace/
+Date:  May 2017
+KernelVersion: 4.13?
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   This directory contains information about the removed memory
+   from the specific NUMA node.
+
+What:  /sys/kernel/debug/powerpc/memtrace//size
+Date:  May 2017
+KernelVersion: 4.13?
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   This contains the size of the memory removed from the node.
+
+What:  /sys/kernel/debug/powerpc/memtrace//start
+Date:  May 2017
+KernelVersion: 4.13?
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   This contains the start address of the removed memory.
+
+What:  /sys/kernel/debug/powerpc/memtrace//trace
+Date:  May 2017
+KernelVersion: 4.13?
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   This is where the hardware trace macro will output the trace
+   it generates.
-- 
2.9.3



Re: linux-next: Tree for May 31

2017-05-31 Thread Michael Ellerman
Stephen Rothwell  writes:

> Hi all,
>
> Changes since 20170530:
>
> The mfd tree gained a build failure so I used the version from
> next-20170530.
>
> The drivers-x86 tree gained the same build failure as the mfd tree so
> I used the version from next-20170530.
>
> The rtc tree gained a build failure so I used the version from
> next-20170530.
>
> The akpm tree lost a patch that turned up elsewhere.
>
> Non-merge commits (relative to Linus' tree): 3325
>  3598 files changed, 135000 insertions(+), 72065 deletions(-)

More or less all my powerpc boxes failed to boot this.

All the stack traces point to new_slab():

  PID hash table entries: 4096 (order: -1, 32768 bytes)
  Memory: 127012480K/134217728K available (12032K kernel code, 1920K rwdata, 
2916K rodata, 1088K init, 14065K bss, 487808K reserved, 6717440K cma-reserved)
  Unable to handle kernel paging request for data at address 0x04f0
  Faulting instruction address: 0xc033fd48
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 
  NUMA 
  PowerNV
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper Not tainted 
4.12.0-rc3-gccN-next-20170531-gf2882f4 #1
  task: c0fb1200 task.stack: c1104000
  NIP: c033fd48 LR: c033fb1c CTR: c02d6ae0
  REGS: c1107970 TRAP: 0380   Not tainted  
(4.12.0-rc3-gccN-next-20170531-gf2882f4)
  MSR: 92001033 
CR: 22042244  XER: 
  CFAR: c033fbfc SOFTE: 0 
  GPR00: c033fb1c c1107bf0 c1108b00 c0076180 
  GPR04: c1139600  0007f988 0080 
  GPR08: c11cf5d8 04f0  c0076280 
  GPR12: 28042822 cfd4   
  GPR16:  c0dc9198 c0dc91c8 006f 
  GPR20: 0001 2000 014000c0  
  GPR24: 0201 c007f901  80010400 
  GPR28: 0001 0006 f1fe4000 c0f15958 
  NIP [c033fd48] new_slab+0x318/0x710
  LR [c033fb1c] new_slab+0xec/0x710
  Call Trace:
  [c1107bf0] [c033fb1c] new_slab+0xec/0x710 (unreliable)
  [c1107cc0] [c0348cc0] __kmem_cache_create+0x270/0x800
  [c1107df0] [c0ece8b4] create_boot_cache+0xa0/0xe4
  [c1107e70] [c0ed30d0] kmem_cache_init+0x68/0x16c
  [c1107f00] [c0ea0b08] start_kernel+0x2a0/0x554
  [c1107f90] [c000ad70] start_here_common+0x1c/0x4ac
  Instruction dump:
  57bd039c 79291f24 7fbd0074 7c68482a 7bbdd182 3bbd0005 6000 3d230001 
  e95e0038 e9299a7a 3929009e 79291f24 <7f6a482a> e93b0080 7fa34800 409e036c 
  ---[ end trace  ]---
  
  Kernel panic - not syncing: Attempted to kill the idle task!
  Rebooting in 10 seconds..


cheers


Re: linux-next: Tree for May 31

2017-05-31 Thread Stephen Rothwell
Hi Michael,

On Thu, 01 Jun 2017 16:07:51 +1000 Michael Ellerman  wrote:
>
> Stephen Rothwell  writes:
> 
> > Changes since 20170530:
> >
> > Non-merge commits (relative to Linus' tree): 3325
> >  3598 files changed, 135000 insertions(+), 72065 deletions(-)  
> 
> More or less all my powerpc boxes failed to boot this.

Good timing :-)  How about the linux-next I just released.  It has had
a few of the mm changes removed since yesterday.

> All the stack traces point to new_slab():
> 
>   PID hash table entries: 4096 (order: -1, 32768 bytes)
>   Memory: 127012480K/134217728K available (12032K kernel code, 1920K rwdata, 
> 2916K rodata, 1088K init, 14065K bss, 487808K reserved, 6717440K cma-reserved)
>   Unable to handle kernel paging request for data at address 0x04f0
>   Faulting instruction address: 0xc033fd48
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   SMP NR_CPUS=2048 
>   NUMA 
>   PowerNV
>   Modules linked in:
>   CPU: 0 PID: 0 Comm: swapper Not tainted 
> 4.12.0-rc3-gccN-next-20170531-gf2882f4 #1
>   task: c0fb1200 task.stack: c1104000
>   NIP: c033fd48 LR: c033fb1c CTR: c000002d6ae0
>   REGS: c1107970 TRAP: 0380   Not tainted  
> (4.12.0-rc3-gccN-next-20170531-gf2882f4)
>   MSR: 92001033 
> CR: 22042244  XER: 
>   CFAR: c033fbfc SOFTE: 0 
>   GPR00: c033fb1c c1107bf0 c1108b00 c0076180 
>   GPR04: c1139600  0007f988 0080 
>   GPR08: c11cf5d8 04f0  c0076280 
>   GPR12: 28042822 cfd4   
>   GPR16:  c0dc9198 c0dc91c8 006f 
>   GPR20: 0001 2000 014000c0  
>   GPR24: 0201 c007f901  80010400 
>   GPR28: 0001 0006 f1fe4000 c0f15958 
>   NIP [c033fd48] new_slab+0x318/0x710
>   LR [c033fb1c] new_slab+0xec/0x710
>   Call Trace:
>   [c1107bf0] [c033fb1c] new_slab+0xec/0x710 (unreliable)
>   [c1107cc0] [c0348cc0] __kmem_cache_create+0x270/0x800
>   [c1107df0] [c0ece8b4] create_boot_cache+0xa0/0xe4
>   [c1107e70] [c0ed30d0] kmem_cache_init+0x68/0x16c
>   [c1107f00] [c0ea0b08] start_kernel+0x2a0/0x554
>   [c1107f90] [c000ad70] start_here_common+0x1c/0x4ac
>   Instruction dump:
>   57bd039c 79291f24 7fbd0074 7c68482a 7bbdd182 3bbd0005 6000 3d230001 
>   e95e0038 e9299a7a 3929009e 79291f24 <7f6a482a> e93b0080 7fa34800 409e036c 
>   ---[ end trace  ]---
>   
>   Kernel panic - not syncing: Attempted to kill the idle task!
>   Rebooting in 10 seconds..

-- 
Cheers,
Stephen Rothwell


Re: [PATCH V3 2/2] KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9

2017-05-31 Thread Suraj Jitindar Singh
On Mon, 2017-05-29 at 20:12 +1000, Paul Mackerras wrote:
> This allows userspace (e.g. QEMU) to enable large decrementer mode
> for
> the guest when running on a POWER9 host, by setting the LPCR_LD bit
> in
> the guest LPCR value.  With this, the guest exit code saves 64 bits
> of
> the guest DEC value on exit.  Other places that use the guest DEC
> value check the LPCR_LD bit in the guest LPCR value, and if it is
> set,
> omit the 32-bit sign extension that would otherwise be done.
> 
> This doesn't change the DEC emulation used by PR KVM because PR KVM
> is not supported on POWER9 yet.
> 
> This is partly based on an earlier patch by Oliver O'Halloran.
> 
> Signed-off-by: Paul Mackerras 

Tested with a hacked up qemu and upstream guest/host (with these
patches).

Tested-by: Suraj Jitindar Singh 

> ---
>  arch/powerpc/include/asm/kvm_host.h |  2 +-
>  arch/powerpc/kvm/book3s_hv.c|  6 ++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 29
> -
>  arch/powerpc/kvm/emulate.c  |  4 ++--
>  4 files changed, 33 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h
> b/arch/powerpc/include/asm/kvm_host.h
> index 9c51ac4..3f879c8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -579,7 +579,7 @@ struct kvm_vcpu_arch {
>   ulong mcsrr0;
>   ulong mcsrr1;
>   ulong mcsr;
> - u32 dec;
> + ulong dec;
>  #ifdef CONFIG_BOOKE
>   u32 decar;
>  #endif
> diff --git a/arch/powerpc/kvm/book3s_hv.c
> b/arch/powerpc/kvm/book3s_hv.c
> index 42b7a4f..9b2eb66 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1143,6 +1143,12 @@ static void kvmppc_set_lpcr(struct kvm_vcpu
> *vcpu, u64 new_lpcr,
>   mask = LPCR_DPFD | LPCR_ILE | LPCR_TC;
>   if (cpu_has_feature(CPU_FTR_ARCH_207S))
>   mask |= LPCR_AIL;
> + /*
> +  * On POWER9, allow userspace to enable large decrementer
> for the
> +  * guest, whether or not the host has it enabled.
> +  */
> + if (cpu_has_feature(CPU_FTR_ARCH_300))
> + mask |= LPCR_LD;
>  
>   /* Broken 32-bit version of LPCR must not clear top bits */
>   if (preserve_top32)
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index e390b38..3c901b5 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -920,7 +920,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
>   mftbr7
>   subfr3,r7,r8
>   mtspr   SPRN_DEC,r3
> - stw r3,VCPU_DEC(r4)
> + std r3,VCPU_DEC(r4)
>  
>   ld  r5, VCPU_SPRG0(r4)
>   ld  r6, VCPU_SPRG1(r4)
> @@ -1032,7 +1032,13 @@ kvmppc_cede_reentry:   /* r4 =
> vcpu, r13 = paca */
>   li  r0, BOOK3S_INTERRUPT_EXTERNAL
>   bne cr1, 12f
>   mfspr   r0, SPRN_DEC
> - cmpwi   r0, 0
> +BEGIN_FTR_SECTION
> + /* On POWER9 check whether the guest has large decrementer
> enabled */
> + andis.  r8, r8, LPCR_LD@h
> + bne 15f
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
> + extsw   r0, r0
> +15:  cmpdi   r0, 0
>   li  r0, BOOK3S_INTERRUPT_DECREMENTER
>   bge 5f
>  
> @@ -1459,12 +1465,18 @@ mc_cont:
>   mtspr   SPRN_SPURR,r4
>  
>   /* Save DEC */
> + ld  r3, HSTATE_KVM_VCORE(r13)
>   mfspr   r5,SPRN_DEC
>   mftbr6
> + /* On P9, if the guest has large decr enabled, don't sign
> extend */
> +BEGIN_FTR_SECTION
> + ld  r4, VCORE_LPCR(r3)
> + andis.  r4, r4, LPCR_LD@h
> + bne 16f
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>   extsw   r5,r5
> - add r5,r5,r6
> +16:  add r5,r5,r6
>   /* r5 is a guest timebase value here, convert to host TB */
> - ld  r3,HSTATE_KVM_VCORE(r13)
>   ld  r4,VCORE_TB_OFFSET(r3)
>   subfr5,r4,r5
>   std r5,VCPU_DEC_EXPIRES(r9)
> @@ -2376,8 +2388,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
>   mfspr   r3, SPRN_DEC
>   mfspr   r4, SPRN_HDEC
>   mftbr5
> +BEGIN_FTR_SECTION
> + /* On P9 check whether the guest has large decrementer mode
> enabled */
> + ld  r6, HSTATE_KVM_VCORE(r13)
> + ld  r6, VCORE_LPCR(r6)
> + andis.  r6, r6, LPCR_LD@h
> + bne 68f
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>   extsw   r3, r3
> - EXTEND_HDEC(r4)
> +68:  EXTEND_HDEC(r4)
>   cmpdr3, r4
>   ble 67f
>   mtspr   SPRN_DEC, r4
> diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
> index c873ffe..4d8b4d6 100644
> --- a/arch/powerpc/kvm/emulate.c
> +++ b/arch/powerpc/kvm/emulate.c
> @@ -39,7 +39,7 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu)
>   unsigned long dec_nsec;
>   unsigned long long dec_time;
>  
> - pr_debug("mtDEC: %x\n", vcpu->arch.dec);
> + pr_debug("mtDEC: %lx\n", vcpu->arch.dec);
>   hrtimer_try_to_cancel(&vcpu->arch.dec_ti