Re: [PATCH net-next] powerpc: use asm-generic/socket.h as much as possible
On Wed, May 31, 2017 at 7:43 AM, Stephen Rothwell wrote: > asm-generic/socket.h already has an exception for the differences that > powerpc needs, so just include it after defining the differences. > > Signed-off-by: Stephen Rothwell Acked-by: Arnd Bergmann
Re: [PATCH 6/6] cpuidle-powernv: Allow Deep stop states that don't stop time
On Tue, May 30, 2017 at 09:10:06PM +1000, Nicholas Piggin wrote: > On Tue, 30 May 2017 16:20:55 +0530 > Gautham R Shenoy wrote: > > > On Tue, May 30, 2017 at 05:13:57PM +1000, Nicholas Piggin wrote: > > > On Tue, 16 May 2017 14:19:48 +0530 > > > "Gautham R. Shenoy" wrote: > > > > > > > From: "Gautham R. Shenoy" > > > > > > > > The current code in the cpuidle-powernv intialization only allows deep > > > > stop states (indicated by OPAL_PM_STOP_INST_DEEP) which lose timebase > > > > (indicated by OPAL_PM_TIMEBASE_STOP). This assumption goes back to > > > > POWER8 time where deep states used to lose the timebase. However, on > > > > POWER9, we do have stop states that are deep (they lose hypervisor > > > > state) but retain the timebase. > > > > > > > > Fix the initialization code in the cpuidle-powernv driver to allow > > > > such deep states. > > > > > > > > Further, there is a bug in cpuidle-powernv driver with > > > > CONFIG_TICK_ONESHOT=n where we end up incrementing the nr_idle_states > > > > even if a platform idle state which loses time base was not added to > > > > the cpuidle table. > > > > > > > > Fix this by ensuring that the nr_idle_states variable gets incremented > > > > only when the platform idle state was added to the cpuidle table. > > > > > > Should this be a separate patch? Stable? > > > > Ok. Will send it out separately. > > Looks like mpe has merged this in next now. I just wonder if this > particular bit would be relevant for POWER8 and therefore be a > stable candidate? All the POWER9 idle fixes may not be suitable for > stable. I agree. The other POWER9 fixes aren't suitable for stable. I will clean this patch alone based on your suggestion and mark it for stable. > > > > > > Signed-off-by: Gautham R. Shenoy > > > > --- > > > > drivers/cpuidle/cpuidle-powernv.c | 16 ++-- > > > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > > > > b/drivers/cpuidle/cpuidle-powernv.c > > > > index 12409a5..45eaf06 100644 > > > > --- a/drivers/cpuidle/cpuidle-powernv.c > > > > +++ b/drivers/cpuidle/cpuidle-powernv.c > > > > @@ -354,6 +354,7 @@ static int powernv_add_idle_states(void) > > > > > > > > for (i = 0; i < dt_idle_states; i++) { > > > > unsigned int exit_latency, target_residency; > > > > + bool stops_timebase = false; > > > > /* > > > > * If an idle state has exit latency beyond > > > > * POWERNV_THRESHOLD_LATENCY_NS then don't use it > > > > @@ -381,6 +382,9 @@ static int powernv_add_idle_states(void) > > > > } > > > > } > > > > > > > > + if (flags[i] & OPAL_PM_TIMEBASE_STOP) > > > > + stops_timebase = true; > > > > + > > > > /* > > > > * For nap and fastsleep, use default target_residency > > > > * values if f/w does not expose it. > > > > @@ -392,8 +396,7 @@ static int powernv_add_idle_states(void) > > > > add_powernv_state(nr_idle_states, "Nap", > > > > CPUIDLE_FLAG_NONE, nap_loop, > > > > target_residency, > > > > exit_latency, 0, 0); > > > > - } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) && > > > > - !(flags[i] & OPAL_PM_TIMEBASE_STOP)) { > > > > + } else if (has_stop_states && !stops_timebase) { > > > > add_powernv_state(nr_idle_states, names[i], > > > > CPUIDLE_FLAG_NONE, stop_loop, > > > > target_residency, > > > > exit_latency, > > > > @@ -405,8 +408,8 @@ static int powernv_add_idle_states(void) > > > > * within this config dependency check. > > > > */ > > > > #ifdef CONFIG_TICK_ONESHOT > > > > - if (flags[i] & OPAL_PM_SLEEP_ENABLED || > > > > - flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) { > > > > + else if (flags[i] & OPAL_PM_SLEEP_ENABLED || > > > > +flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) { > > > > > > Hmm, seems okay but readability is isn't the best with the ifdef and > > > mixing power8 and 9 cases IMO. > > > > > > Particularly with the nice regular POWER9 states, we're not doing much > > > logic in this loop besides checking for the timebase stop flag, right? > > > Would it be clearer if it was changed to something like this (untested > > > quick hack)? > > > > Yes, this is very much doable. Some comments below. > > > > > > > > --- > > > drivers/cpuidle/cpuidle-powernv.c | 76 > > > +++ > > > 1 file changed, 37 insertions(+), 39 deletions(-) > > > > > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > > > b/drivers/cpuidle/cpuidle-powern
Re: [PATCH net-next] powerpc: use asm-generic/socket.h as much as possible
Stephen Rothwell writes: > asm-generic/socket.h already has an exception for the differences that > powerpc needs, so just include it after defining the differences. > > Signed-off-by: Stephen Rothwell > --- > arch/powerpc/include/uapi/asm/socket.h | 92 > +- > 1 file changed, 1 insertion(+), 91 deletions(-) > > Build tested using powerpc allyesconfig, pseries_le_defconfig, 32 bit > and 64 bit allnoconfig and ppc44x_defconfig builds. Did you boot it and test that userspace was happy doing sockety things? cheers
Re: [PATCH] rtc/tpo: Handle disabled TPO in opal_get_tpo_time()
On 19/05/2017 at 15:35:09 +0530, Vaibhav Jain wrote: > On PowerNV platform when Timed-Power-On(TPO) is disabled, read of > stored TPO yields value with all date components set to '0' inside > opal_get_tpo_time(). The function opal_to_tm() then converts it to an > offset from year 1900 yielding alarm-time == "1900-00-01 > 00:00:00". This causes problems with __rtc_read_alarm() that > expecting an offset from "1970-00-01 00:00:00" and returned alarm-time > results in a -ve value for time64_t. Which ultimately results in this > error reported in kernel logs with a seemingly garbage value: > > "rtc rtc0: invalid alarm value: -2-1--1041528741 > 200557:71582844:32" > > We fix this by explicitly handling the case of all alarm date-time > components being '0' inside opal_get_tpo_time() and returning -ENOENT > in such a case. This signals generic rtc that no alarm is set and it > bails out from the alarm initialization flow without reporting the > above error. > > Signed-off-by: Vaibhav Jain > Reported-by: Steve Best > --- > drivers/rtc/rtc-opal.c | 10 ++ > 1 file changed, 10 insertions(+) > Applied, thanks. -- Alexandre Belloni, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com
Re: [PATCH] drivers/rtc/interface.c: Validate alarm-time before handling rollover
On 19/05/2017 at 22:18:55 +0530, Vaibhav Jain wrote: > In function __rtc_read_alarm() its possible for an alarm time-stamp to > be invalid even after replacing missing components with current > time-stamp. The condition 'alarm->time.tm_year < 70' will trigger this > case and will cause the call to 'rtc_tm_to_time64(&alarm->time)' > return a negative value for variable t_alm. > > While handling alarm rollover this negative t_alm (assumed to seconds > offset from '1970-01-01 00:00:00') is converted back to rtc_time via > rtc_time64_to_tm() which results in this error log with seemingly > garbage values: > > "rtc rtc0: invalid alarm value: -2-1--1041528741 > 200557:71582844:32" > > This error was generated when the rtc driver (rtc-opal in this case) > returned an alarm time-stamp of '00-00-00 00:00:00' to indicate that > the alarm is disabled. Though I have submitted a separate fix for the > rtc-opal driver, this issue may potentially impact other > existing/future rtc drivers. > > To fix this issue the patch validates the alarm time-stamp just after > filling up the missing datetime components and if rtc_valid_tm() still > reports it to be invalid then bails out of the function without > handling the rollover. > > Reported-by: Steve Best > Signed-off-by: Vaibhav Jain > --- > drivers/rtc/interface.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > Applied, thanks. -- Alexandre Belloni, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com
Re: [PATCH net-next] powerpc: use asm-generic/socket.h as much as possible
Hi Michael, On Wed, 31 May 2017 20:15:55 +1000 Michael Ellerman wrote: > > Stephen Rothwell writes: > > > asm-generic/socket.h already has an exception for the differences that > > powerpc needs, so just include it after defining the differences. > > > > Signed-off-by: Stephen Rothwell > > --- > > arch/powerpc/include/uapi/asm/socket.h | 92 > > +- > > 1 file changed, 1 insertion(+), 91 deletions(-) > > > > Build tested using powerpc allyesconfig, pseries_le_defconfig, 32 bit > > and 64 bit allnoconfig and ppc44x_defconfig builds. > > Did you boot it and test that userspace was happy doing sockety things? No, sorry. The patch was done by inspection, but it is pretty obvious ... here is the diff between arch/powerpc/include/uapi/asm/socket.h and include/uapi/asm-generic/socket.h before the patch: --- arch/powerpc/include/uapi/asm/socket.h 2017-05-31 20:56:54.940473709 +1000 +++ include/uapi/asm-generic/socket.h 2017-05-31 10:04:16.716445463 +1000 @@ -1,12 +1,5 @@ -#ifndef _ASM_POWERPC_SOCKET_H -#define _ASM_POWERPC_SOCKET_H - -/* - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ +#ifndef __ASM_GENERIC_SOCKET_H +#define __ASM_GENERIC_SOCKET_H #include @@ -30,12 +23,14 @@ #define SO_LINGER 13 #define SO_BSDCOMPAT 14 #define SO_REUSEPORT 15 -#define SO_RCVLOWAT16 -#define SO_SNDLOWAT17 -#define SO_RCVTIMEO18 -#define SO_SNDTIMEO19 -#define SO_PASSCRED20 -#define SO_PEERCRED21 +#ifndef SO_PASSCRED /* powerpc only differs in these */ +#define SO_PASSCRED16 +#define SO_PEERCRED17 +#define SO_RCVLOWAT18 +#define SO_SNDLOWAT19 +#define SO_RCVTIMEO20 +#define SO_SNDTIMEO21 +#endif /* Security levels - as per NRL IPv6 - don't actually do anything */ #define SO_SECURITY_AUTHENTICATION 22 @@ -71,7 +66,7 @@ #define SO_RXQ_OVFL 40 #define SO_WIFI_STATUS 41 -#define SCM_WIFI_STATUSSO_WIFI_STATUS +#define SCM_WIFI_STATUSSO_WIFI_STATUS #define SO_PEEK_OFF42 /* Instruct lower device to use last 4-bytes of skb data as FCS */ @@ -107,4 +102,4 @@ #define SCM_TIMESTAMPING_PKTINFO 58 -#endif /* _ASM_POWERPC_SOCKET_H */ +#endif /* __ASM_GENERIC_SOCKET_H */ -- Cheers, Stephen Rothwell
Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc
On 05/29/2017 12:32 AM, Michael Ellerman wrote: > Reza Arbab writes: > >> On Fri, May 26, 2017 at 01:46:58PM +1000, Michael Ellerman wrote: >>> Reza Arbab writes: >>> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote: > The commit message for 3af229f2071f says: > >In practice, we never see a system with 256 NUMA nodes, and in fact, we >do not support node hotplug on power in the first place, so the nodes >^^^ >that are online when we come up are the nodes that will be present for >the lifetime of this kernel. > > Is that no longer true? I don't know what the reasoning behind that statement was at the time, but as far as I can tell, the only thing missing for node hotplug now is Balbir's patchset [1]. He fixes the resource issue which motivated 3af229f2071f and reverts it. With that set, I can instantiate a new numa node just by doing add_memory(nid, ...) where nid doesn't currently exist. >>> >>> But does that actually happen on any real system? >> >> I don't know if anything currently tries to do this. My interest in >> having this working is so that in the future, our coherent gpu memory >> could be added as a distinct node by the device driver. > > Sure. If/when that happens, we would hopefully still have some way to > limit the size of the possible map. > > That would ideally be a firmware property that tells us the maximum > number of GPUs that might be hot-added, or we punt and cap it at some > "sane" maximum number. > > But until that happens it's silly to say we can have up to 256 nodes > when in practice most of our systems have 8 or less. > > So I'm still waiting for an explanation from Michael B on how he's > seeing this bug in practice. I already answered this in an earlier message. I will give an example. * Let there be a configuration with nodes (0, 4-5, 8) that boots with 1 VP and 10G of memory in a shared processor configuration. * At boot time, 4 nodes are put into the possible map by the PowerPC boot code. * Subsequently, the NUMA code executes and puts the 10G memory into nodes 4 & 5. No memory goes into Node 0. So we now have 2 nodes in the node_online_map. * The VP and its threads get assigned to Node 4. * Then when 'initmem_init()' in 'powerpc/numa.c' executes the instruction, node_and(node_possible_map, node_possible_map, node_online_map); the content of the node_possible_map is reduced to nodes 4-5. * Later on we hot-add 90G of memory to the system. It tries to put the memory into nodes 0, 4-5, 8 based on the memory association map. We should see memory put into all 4 nodes. However, since we have reduced the 'node_possible_map' to only nodes 4 & 5, we can now only put memory into 2 of the configured nodes. # We want to be able to put memory into all 4 nodes via hot-add operations, not only the nodes that 'survive' boot time initialization. We could make a number of changes to ensure that all of the nodes in the initial configuration provided by the pHyp can be used, but this one appears to be the simplest, only using resources requested by the pHyp at boot -- even if those resource are not used immediately. > > cheers > Regards, Michael -- Michael W. Bringmann Linux Technology Center IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com
Re: 4.12-rc ppc64 4k-page needs costly allocations
On Tue, 30 May 2017, Hugh Dickins wrote: > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that: > it seemed to be a hard requirement for something, but I didn't find what. CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to be able to enable it at runtime. > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to > the expected order:3, which then results in OOM-killing rather than direct > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff. But > makes no real difference to the outcome: swapping loads still abort early. SLAB uses order 3 and SLUB order 4??? That needs to be tracked down. Why are the slab allocators used to create slab caches for large object sizes? > Relying on order:3 or order:4 allocations is just too optimistic: ppc64 > with 4k pages would do better not to expect to support a 128TB userspace. I thought you had these huge 64k page sizes?
Re: 4.12-rc ppc64 4k-page needs costly allocations
On Wed, 31 May 2017, Michael Ellerman wrote: > > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL) > > cache: pgtable-2^12, object size: 32768, buffer size: 65536, default > > order: 4, min order: 4 > > pgtable-2^12 debugging increased min order, use slub_debug=O to disable. Ahh. Ok debugging increased the object size to an order 4. This should be order 3 without debugging. > > I did try booting with slub_debug=O as the message suggested, but that > > made no difference: it still hoped for but failed on order:4 allocations. I am curious as to what is going on there. Do you have the output from these failed allocations?
Re: [v3 0/9] parallelized "struct page" zeroing
On Tue 30-05-17 13:16:50, Pasha Tatashin wrote: > >Could you be more specific? E.g. how are other stores done in > >__init_single_page safe then? I am sorry to be dense here but how does > >the full 64B store differ from other stores done in the same function. > > Hi Michal, > > It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb) > without membar, but they are slower compared to STBI which require a membar > before memory can be accessed. OK, so why cannot we make zero_struct_page 8x 8B stores, other arches would do memset. You said it would be slower but would that be measurable? I am sorry to be so persistent here but I would be really happier if this didn't depend on the deferred initialization. If this is absolutely a no-go then I can live with that of course. -- Michal Hocko SUSE Labs
Re: [v3 0/9] parallelized "struct page" zeroing
From: Michal Hocko Date: Wed, 31 May 2017 18:31:31 +0200 > On Tue 30-05-17 13:16:50, Pasha Tatashin wrote: >> >Could you be more specific? E.g. how are other stores done in >> >__init_single_page safe then? I am sorry to be dense here but how does >> >the full 64B store differ from other stores done in the same function. >> >> Hi Michal, >> >> It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb) >> without membar, but they are slower compared to STBI which require a membar >> before memory can be accessed. > > OK, so why cannot we make zero_struct_page 8x 8B stores, other arches > would do memset. You said it would be slower but would that be > measurable? I am sorry to be so persistent here but I would be really > happier if this didn't depend on the deferred initialization. If this is > absolutely a no-go then I can live with that of course. It is measurable. That's the impetus for this work in the first place. When the do the memory barrier, the whole store buffer flushes because the memory barrier is done with a dependency on the next load or store operation, one of which the caller is going to do immediately.
Re: 4.12-rc ppc64 4k-page needs costly allocations
[ Merging two mails into one response ] On Wed, 31 May 2017, Christoph Lameter wrote: > On Tue, 30 May 2017, Hugh Dickins wrote: > > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL) > > cache: pgtable-2^12, object size: 32768, buffer size: 65536, default > > order: 4, min order: 4 > > pgtable-2^12 debugging increased min order, use slub_debug=O to disable. > > > I did try booting with slub_debug=O as the message suggested, but that > > made no difference: it still hoped for but failed on order:4 allocations. > > I am curious as to what is going on there. Do you have the output from > these failed allocations? I thought the relevant output was in my mail. I did skip the Mem-Info dump, since that just seemed noise in this case: we know memory can get fragmented. What more output are you looking for? > > > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that: > > it seemed to be a hard requirement for something, but I didn't find what. > > CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to > be able to enable it at runtime. Yes, I thought so. > > > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to > > the expected order:3, which then results in OOM-killing rather than direct > > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff. But > > makes no real difference to the outcome: swapping loads still abort early. > > SLAB uses order 3 and SLUB order 4??? That needs to be tracked down. > > Ahh. Ok debugging increased the object size to an order 4. This should be > order 3 without debugging. But it was still order 4 when booted with slub_debug=O, which surprised me. And that surprises you too? If so, then we ought to dig into it further. > > Why are the slab allocators used to create slab caches for large object > sizes? There may be more optimal ways to allocate, but I expect that when the ppc guys are writing the code to handle both 4k and 64k page sizes, kmem caches offer the best span of possibility without complication. > > > Relying on order:3 or order:4 allocations is just too optimistic: ppc64 > > with 4k pages would do better not to expect to support a 128TB userspace. > > I thought you had these huge 64k page sizes? ppc64 does support 64k page sizes, and they've been the default for years; but since 4k pages are still supported, I choose to use those (I doubt I could ever get the same load going with 64k pages). Hugh
Re: 4.12-rc ppc64 4k-page needs costly allocations
On Wed, May 31, 2017 at 8:44 PM, Hugh Dickins wrote: > [ Merging two mails into one response ] > > On Wed, 31 May 2017, Christoph Lameter wrote: >> On Tue, 30 May 2017, Hugh Dickins wrote: >> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL) >> > cache: pgtable-2^12, object size: 32768, buffer size: 65536, default >> > order: 4, min order: 4 >> > pgtable-2^12 debugging increased min order, use slub_debug=O to disable. >> >> > I did try booting with slub_debug=O as the message suggested, but that >> > made no difference: it still hoped for but failed on order:4 allocations. >> >> I am curious as to what is going on there. Do you have the output from >> these failed allocations? > > I thought the relevant output was in my mail. I did skip the Mem-Info > dump, since that just seemed noise in this case: we know memory can get > fragmented. What more output are you looking for? > >> >> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that: >> > it seemed to be a hard requirement for something, but I didn't find what. >> >> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to >> be able to enable it at runtime. > > Yes, I thought so. > >> >> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to >> > the expected order:3, which then results in OOM-killing rather than direct >> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff. But >> > makes no real difference to the outcome: swapping loads still abort early. >> >> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down. >> >> Ahh. Ok debugging increased the object size to an order 4. This should be >> order 3 without debugging. > > But it was still order 4 when booted with slub_debug=O, which surprised me. > And that surprises you too? If so, then we ought to dig into it further. > >> >> Why are the slab allocators used to create slab caches for large object >> sizes? > > There may be more optimal ways to allocate, but I expect that when > the ppc guys are writing the code to handle both 4k and 64k page sizes, > kmem caches offer the best span of possibility without complication. > >> >> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64 >> > with 4k pages would do better not to expect to support a 128TB userspace. >> >> I thought you had these huge 64k page sizes? > > ppc64 does support 64k page sizes, and they've been the default for years; > but since 4k pages are still supported, I choose to use those (I doubt > I could ever get the same load going with 64k pages). 4k is pretty much required on ppc64 when it comes to nouveau: https://bugs.freedesktop.org/show_bug.cgi?id=94757 2cts
Re: [v3 0/9] parallelized "struct page" zeroing
OK, so why cannot we make zero_struct_page 8x 8B stores, other arches would do memset. You said it would be slower but would that be measurable? I am sorry to be so persistent here but I would be really happier if this didn't depend on the deferred initialization. If this is absolutely a no-go then I can live with that of course. Hi Michal, This is actually a very good idea. I just did some measurements, and it looks like performance is very good. Here is data from SPARC-M7 with 3312G memory with single thread performance: Current: memset() in memblock allocator takes: 8.83s __init_single_page() take: 8.63s Option 1: memset() in __init_single_page() takes: 61.09s (as we discussed because of membar overhead, memset should really be optimized to do STBI only when size is 1 page or bigger). Option 2: 8 stores (stx) in __init_single_page(): 8.525s! So, even for single thread performance we can double the initialization speed of "struct page" on SPARC by removing memset() from memblock, and using 8 stx in __init_single_page(). It appears we never miss L1 in __init_single_page() after the initial 8 stx. I will update patches with memset() on other platforms, and stx on SPARC. My experimental code looks like this: static void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { __asm__ __volatile__( "stx%%g0, [%0 + 0x00]\n" "stx%%g0, [%0 + 0x08]\n" "stx%%g0, [%0 + 0x10]\n" "stx%%g0, [%0 + 0x18]\n" "stx%%g0, [%0 + 0x20]\n" "stx%%g0, [%0 + 0x28]\n" "stx%%g0, [%0 + 0x30]\n" "stx%%g0, [%0 + 0x38]\n" : :"r"(page)); set_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); INIT_LIST_HEAD(&page->lru); #ifdef WANT_PAGE_VIRTUAL /* The shift won't overflow because ZONE_NORMAL is below 4G. */ if (!is_highmem_idx(zone)) set_page_address(page, __va(pfn << PAGE_SHIFT)); #endif } Thank you, Pasha
Re: [PATCH] powerpc/powernv: Enable PCI peer-to-peer
Hi Frederic, [auto build test ERROR on powerpc/next] [also build test ERROR on v4.12-rc3 next-20170531] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Frederic-Barrat/powerpc-powernv-Enable-PCI-peer-to-peer/20170531-035613 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-allmodconfig (attached as .config) compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All errors (new ones prefixed by >>): >> arch/powerpc/platforms/powernv/pci-ioda.c:1411:13: error: static declaration >> of 'pnv_pci_ioda2_set_bypass' follows non-static declaration static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); ^~~~ In file included from arch/powerpc/platforms/powernv/pci-ioda.c:48:0: arch/powerpc/platforms/powernv/pci.h:233:13: note: previous declaration of 'pnv_pci_ioda2_set_bypass' was here extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); ^~~~ vim +/pnv_pci_ioda2_set_bypass +1411 arch/powerpc/platforms/powernv/pci-ioda.c ee8222fe Wei Yang 2015-10-22 1405 pnv_pci_vf_release_m64(pdev, num_vfs); 781a868f Wei Yang 2015-03-25 1406 return -EBUSY; 781a868f Wei Yang 2015-03-25 1407 } 781a868f Wei Yang 2015-03-25 1408 c035e37b Alexey Kardashevskiy 2015-06-05 1409 static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group, c035e37b Alexey Kardashevskiy 2015-06-05 1410 int num); c035e37b Alexey Kardashevskiy 2015-06-05 @1411 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); c035e37b Alexey Kardashevskiy 2015-06-05 1412 781a868f Wei Yang 2015-03-25 1413 static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) 781a868f Wei Yang 2015-03-25 1414 { :: The code at line 1411 was first introduced by commit :: c035e37b58c75ca216bfd1d5de3c1080ac0022b9 powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release :: TO: Alexey Kardashevskiy :: CC: Michael Ellerman --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: 4.12-rc ppc64 4k-page needs costly allocations
Hugh Dickins writes: > Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB") > I find that swapping loads on ppc64 on G5 with 4k pages are failing: > > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL) > cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: > 4, min order: 4 > pgtable-2^12 debugging increased min order, use slub_debug=O to disable. > node 0: slabs: 209, objs: 209, free: 8 > gcc: page allocation failure: order:4, > mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null) > CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1 > Call Trace: > [c090b5c0] [c04f8478] .dump_stack+0xa0/0xcc (unreliable) > [c090b650] [c00eb194] .warn_alloc+0xf0/0x178 > [c090b710] [c00ebc9c] .__alloc_pages_nodemask+0xa04/0xb00 > [c090b8b0] [c013921c] .new_slab+0x234/0x608 > [c090b980] [c013b59c] .___slab_alloc.constprop.64+0x3dc/0x564 > [c090bad0] [c04f5a84] > .__slab_alloc.isra.61.constprop.63+0x54/0x70 > [c090bb70] [c013b864] .kmem_cache_alloc+0x140/0x288 > [c090bc30] [c004d934] .mm_init.isra.65+0x128/0x1c0 > [c090bcc0] [c0157810] .do_execveat_common.isra.39+0x294/0x690 > [c090bdb0] [c0157e70] .SyS_execve+0x28/0x38 > [c090be30] [c000a118] system_call+0x38/0xfc > > I did try booting with slub_debug=O as the message suggested, but that > made no difference: it still hoped for but failed on order:4 allocations. > > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that: > it seemed to be a hard requirement for something, but I didn't find what. > > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to > the expected order:3, which then results in OOM-killing rather than direct > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff. But > makes no real difference to the outcome: swapping loads still abort early. > > Relying on order:3 or order:4 allocations is just too optimistic: ppc64 > with 4k pages would do better not to expect to support a 128TB userspace. > > I tried the obvious partial revert below, but it's not good enough: > the system did not boot beyond > > Starting init: /sbin/init exists but couldn't execute it (error -7) > Starting init: /bin/sh exists but couldn't execute it (error -7) > Kernel panic - not syncing: No working init found. ... > Can you try this patch. commit fc55c0dc8b23446f937c1315aa61e74673de5ee6 Author: Aneesh Kumar K.V Date: Thu Jun 1 08:06:40 2017 +0530 powerpc/mm/4k: Limit 4k page size to 64TB Supporting 512TB requires us to do a order 3 allocation for level 1 page table(pgd). Limit 4k to 64TB for now. Signed-off-by: Aneesh Kumar K.V diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h index b4b5e6b671ca..0c4e470571ca 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h @@ -8,7 +8,7 @@ #define H_PTE_INDEX_SIZE 9 #define H_PMD_INDEX_SIZE 7 #define H_PUD_INDEX_SIZE 9 -#define H_PGD_INDEX_SIZE 12 +#define H_PGD_INDEX_SIZE 9 #ifndef __ASSEMBLY__ #define H_PTE_TABLE_SIZE (sizeof(pte_t) << H_PTE_INDEX_SIZE) diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index a2123f291ab0..5de3271026f1 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -110,13 +110,15 @@ void release_thread(struct task_struct *); #define TASK_SIZE_128TB (0x8000UL) #define TASK_SIZE_512TB (0x0002UL) -#ifdef CONFIG_PPC_BOOK3S_64 +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES) /* * Max value currently used: */ -#define TASK_SIZE_USER64 TASK_SIZE_512TB +#define TASK_SIZE_USER64 TASK_SIZE_512TB +#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_128TB #else -#define TASK_SIZE_USER64 TASK_SIZE_64TB +#define TASK_SIZE_USER64 TASK_SIZE_64TB +#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_64TB #endif /* @@ -132,7 +134,7 @@ void release_thread(struct task_struct *); * space during mmap's. */ #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4)) -#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_128TB / 4)) +#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(DEFAULT_MAP_WINDOW_USER64 / 4)) #define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \ TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 ) @@ -143,8 +145,8 @@ void release_thread(struct task_struct *); * with 128TB and conditionally enable upto 512TB */ #ifdef CONFIG_PPC_BOOK3S_64 -#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ? \ -TASK_SIZE_USER32 : TASK_SIZE_128TB) +#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ?\ +
[PATCH v3 1/3] powerpc/powernv: Add config option for removal of memory
This patch adds the config option to enable the removal of memory from the kernel mappings at runtime. This needs to be enabled for the hardware trace macro to work. Signed-off-by: Rashmica Gupta --- v2 -> v3: Better description arch/powerpc/platforms/powernv/Kconfig | 8 arch/powerpc/platforms/powernv/Makefile | 1 + 2 files changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 6a6f4ef..92493d6 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -30,3 +30,11 @@ config OPAL_PRD help This enables the opal-prd driver, a facility to run processor recovery diagnostics on OpenPower machines + +config PPC64_HARDWARE_TRACING + bool "Enable removal of RAM from kernel mappings for tracing" + help + Enabling this option allows for the removal of memory (RAM) + from the kernel mappings to be used for hardware tracing. + depends on MEMORY_HOTREMOVE + default n diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index b5d98cb..8fb026d 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -12,3 +12,4 @@ obj-$(CONFIG_PPC_SCOM)+= opal-xscom.o obj-$(CONFIG_MEMORY_FAILURE) += opal-memory-errors.o obj-$(CONFIG_TRACEPOINTS) += opal-tracepoints.o obj-$(CONFIG_OPAL_PRD) += opal-prd.o +obj-$(CONFIG_PPC64_HARDWARE_TRACING) += memtrace.o -- 2.9.3
[PATCH v3 2/3] powerpc/powernv: Enable removal of memory for in memory tracing
The hardware trace macro feature requires access to a chunk of real memory. This patch provides a debugfs interface to do this. By writing an integer containing the size of memory to be unplugged into /sys/kernel/debug/powerpc/memtrace/enable, the code will attempt to remove that much memory from the end of each NUMA node. This patch also adds additional debugsfs files for each node that allows the tracer to interact with the removed memory, as well as a trace file that allows userspace to read the generated trace. Note that this patch does not invoke the hardware trace macro, it only allows memory to be removed during runtime for the trace macro to utilise. Signed-off-by: Rashmica Gupta --- v2 -> v3 : - Some changes required to compile with 4.12-rc3. - Iterating from end of node rather than the start. - As io_remap_pfn_range is defined as remap_pfn_range, just use remap_pfn_range. - Removed the creation of the node debugsfs file as it had no use. arch/powerpc/platforms/powernv/memtrace.c | 289 ++ 1 file changed, 289 insertions(+) create mode 100644 arch/powerpc/platforms/powernv/memtrace.c diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c new file mode 100644 index 000..21fa2e4 --- /dev/null +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -0,0 +1,289 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * Copyright (C) IBM Corporation, 2014 + * + * Author: Anton Blanchard + */ + +#define pr_fmt(fmt) "powernv-memtrace: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* This enables us to keep track of the memory removed from each node. */ +struct memtrace_entry { + void *mem; + u64 start; + u64 size; + u32 nid; + struct dentry *dir; + char name[16]; +}; + +static struct memtrace_entry *memtrace_array; +static unsigned int memtrace_array_nr; + +static ssize_t memtrace_read(struct file *filp, char __user *ubuf, +size_t count, loff_t *ppos) +{ + struct memtrace_entry *ent = filp->private_data; + + return simple_read_from_buffer(ubuf, count, ppos, ent->mem, ent->size); +} + +static bool valid_memtrace_range(struct memtrace_entry *dev, +unsigned long start, unsigned long size) +{ + if ((start >= dev->start) && + ((start + size) <= (dev->start + dev->size))) + return true; + + return false; +} + +static int memtrace_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long size = vma->vm_end - vma->vm_start; + struct memtrace_entry *dev = filp->private_data; + + if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, size)) + return -EINVAL; + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + + if (remap_pfn_range(vma, vma->vm_start, + vma->vm_pgoff + (dev->start >> PAGE_SHIFT), + size, vma->vm_page_prot)) + return -EAGAIN; + + return 0; +} + +static const struct file_operations memtrace_fops = { + .llseek = default_llseek, + .read = memtrace_read, + .mmap = memtrace_mmap, + .open = simple_open, +}; + +static void flush_memory_region(u64 base, u64 size) +{ + unsigned long line_size = ppc64_caches.l1d.size; + u64 end = base + size; + u64 addr; + + base = round_down(base, line_size); + end = round_up(end, line_size); + + for (addr = base; addr < end; addr += line_size) + asm volatile("dcbf 0,%0" : "=r" (addr) :: "memory"); +} + +static int check_memblock_online(struct memory_block *mem, void *arg) +{ + if (mem->state != MEM_ONLINE) + return -1; + + return 0; +} + +static int change_memblock_state(struct memory_block *mem, void *arg) +{ + unsigned long state = (unsigned long)arg; + + mem->state = state; + return 0; +} + +static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages) +{ + u64 end_pfn = start_pfn + nr_pages - 1; + + if (walk_memory_range(start_pfn, end_pfn, NULL, + check_memblock_online)) + return false; + + walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE, + change_memblock_state); + + if (offline_pages(start_pfn, nr_pages)
[PATCH v3 3/3] Add documentation for the powerpc memtrace debugfs files
CONFIG_PPC64_HARDWARE_TRACING must be set to use this feature. This can only be used on powernv platforms. Signed-off-by: Rashmica Gupta --- Documentation/ABI/testing/ppc-memtrace | 45 ++ 1 file changed, 45 insertions(+) create mode 100644 Documentation/ABI/testing/ppc-memtrace diff --git a/Documentation/ABI/testing/ppc-memtrace b/Documentation/ABI/testing/ppc-memtrace new file mode 100644 index 000..f7eff02 --- /dev/null +++ b/Documentation/ABI/testing/ppc-memtrace @@ -0,0 +1,45 @@ +What: /sys/kernel/debug/powerpc/memtrace +Date: May 2017 +KernelVersion: 4.13? +Contact: linuxppc-dev@lists.ozlabs.org +Description: This folder contains the relevant debugfs files for the + hardware trace macro to use. CONFIG_PPC64_HARDWARE_TRACING + must be set. + +What: /sys/kernel/debug/powerpc/memtrace/enable +Date: May 2017 +KernelVersion: 4.13? +Contact: linuxppc-dev@lists.ozlabs.org +Description: Write an integer containing the size of the memory you want + removed from each NUMA node to this file - it must be + aligned to the memblock size. This amount of RAM will be + removed from the kernel mappings and the following debugfs + files will be created. This can only be successfully done + once per boot. Once memory is successfully removed from + each node, the following files are created. + +What: /sys/kernel/debug/powerpc/memtrace/ +Date: May 2017 +KernelVersion: 4.13? +Contact: linuxppc-dev@lists.ozlabs.org +Description: This directory contains information about the removed memory + from the specific NUMA node. + +What: /sys/kernel/debug/powerpc/memtrace//size +Date: May 2017 +KernelVersion: 4.13? +Contact: linuxppc-dev@lists.ozlabs.org +Description: This contains the size of the memory removed from the node. + +What: /sys/kernel/debug/powerpc/memtrace//start +Date: May 2017 +KernelVersion: 4.13? +Contact: linuxppc-dev@lists.ozlabs.org +Description: This contains the start address of the removed memory. + +What: /sys/kernel/debug/powerpc/memtrace//trace +Date: May 2017 +KernelVersion: 4.13? +Contact: linuxppc-dev@lists.ozlabs.org +Description: This is where the hardware trace macro will output the trace + it generates. -- 2.9.3
Re: linux-next: Tree for May 31
Stephen Rothwell writes: > Hi all, > > Changes since 20170530: > > The mfd tree gained a build failure so I used the version from > next-20170530. > > The drivers-x86 tree gained the same build failure as the mfd tree so > I used the version from next-20170530. > > The rtc tree gained a build failure so I used the version from > next-20170530. > > The akpm tree lost a patch that turned up elsewhere. > > Non-merge commits (relative to Linus' tree): 3325 > 3598 files changed, 135000 insertions(+), 72065 deletions(-) More or less all my powerpc boxes failed to boot this. All the stack traces point to new_slab(): PID hash table entries: 4096 (order: -1, 32768 bytes) Memory: 127012480K/134217728K available (12032K kernel code, 1920K rwdata, 2916K rodata, 1088K init, 14065K bss, 487808K reserved, 6717440K cma-reserved) Unable to handle kernel paging request for data at address 0x04f0 Faulting instruction address: 0xc033fd48 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc3-gccN-next-20170531-gf2882f4 #1 task: c0fb1200 task.stack: c1104000 NIP: c033fd48 LR: c033fb1c CTR: c02d6ae0 REGS: c1107970 TRAP: 0380 Not tainted (4.12.0-rc3-gccN-next-20170531-gf2882f4) MSR: 92001033 CR: 22042244 XER: CFAR: c033fbfc SOFTE: 0 GPR00: c033fb1c c1107bf0 c1108b00 c0076180 GPR04: c1139600 0007f988 0080 GPR08: c11cf5d8 04f0 c0076280 GPR12: 28042822 cfd4 GPR16: c0dc9198 c0dc91c8 006f GPR20: 0001 2000 014000c0 GPR24: 0201 c007f901 80010400 GPR28: 0001 0006 f1fe4000 c0f15958 NIP [c033fd48] new_slab+0x318/0x710 LR [c033fb1c] new_slab+0xec/0x710 Call Trace: [c1107bf0] [c033fb1c] new_slab+0xec/0x710 (unreliable) [c1107cc0] [c0348cc0] __kmem_cache_create+0x270/0x800 [c1107df0] [c0ece8b4] create_boot_cache+0xa0/0xe4 [c1107e70] [c0ed30d0] kmem_cache_init+0x68/0x16c [c1107f00] [c0ea0b08] start_kernel+0x2a0/0x554 [c1107f90] [c000ad70] start_here_common+0x1c/0x4ac Instruction dump: 57bd039c 79291f24 7fbd0074 7c68482a 7bbdd182 3bbd0005 6000 3d230001 e95e0038 e9299a7a 3929009e 79291f24 <7f6a482a> e93b0080 7fa34800 409e036c ---[ end trace ]--- Kernel panic - not syncing: Attempted to kill the idle task! Rebooting in 10 seconds.. cheers
Re: linux-next: Tree for May 31
Hi Michael, On Thu, 01 Jun 2017 16:07:51 +1000 Michael Ellerman wrote: > > Stephen Rothwell writes: > > > Changes since 20170530: > > > > Non-merge commits (relative to Linus' tree): 3325 > > 3598 files changed, 135000 insertions(+), 72065 deletions(-) > > More or less all my powerpc boxes failed to boot this. Good timing :-) How about the linux-next I just released. It has had a few of the mm changes removed since yesterday. > All the stack traces point to new_slab(): > > PID hash table entries: 4096 (order: -1, 32768 bytes) > Memory: 127012480K/134217728K available (12032K kernel code, 1920K rwdata, > 2916K rodata, 1088K init, 14065K bss, 487808K reserved, 6717440K cma-reserved) > Unable to handle kernel paging request for data at address 0x04f0 > Faulting instruction address: 0xc033fd48 > Oops: Kernel access of bad area, sig: 11 [#1] > SMP NR_CPUS=2048 > NUMA > PowerNV > Modules linked in: > CPU: 0 PID: 0 Comm: swapper Not tainted > 4.12.0-rc3-gccN-next-20170531-gf2882f4 #1 > task: c0fb1200 task.stack: c1104000 > NIP: c033fd48 LR: c033fb1c CTR: c000002d6ae0 > REGS: c1107970 TRAP: 0380 Not tainted > (4.12.0-rc3-gccN-next-20170531-gf2882f4) > MSR: 92001033 > CR: 22042244 XER: > CFAR: c033fbfc SOFTE: 0 > GPR00: c033fb1c c1107bf0 c1108b00 c0076180 > GPR04: c1139600 0007f988 0080 > GPR08: c11cf5d8 04f0 c0076280 > GPR12: 28042822 cfd4 > GPR16: c0dc9198 c0dc91c8 006f > GPR20: 0001 2000 014000c0 > GPR24: 0201 c007f901 80010400 > GPR28: 0001 0006 f1fe4000 c0f15958 > NIP [c033fd48] new_slab+0x318/0x710 > LR [c033fb1c] new_slab+0xec/0x710 > Call Trace: > [c1107bf0] [c033fb1c] new_slab+0xec/0x710 (unreliable) > [c1107cc0] [c0348cc0] __kmem_cache_create+0x270/0x800 > [c1107df0] [c0ece8b4] create_boot_cache+0xa0/0xe4 > [c1107e70] [c0ed30d0] kmem_cache_init+0x68/0x16c > [c1107f00] [c0ea0b08] start_kernel+0x2a0/0x554 > [c1107f90] [c000ad70] start_here_common+0x1c/0x4ac > Instruction dump: > 57bd039c 79291f24 7fbd0074 7c68482a 7bbdd182 3bbd0005 6000 3d230001 > e95e0038 e9299a7a 3929009e 79291f24 <7f6a482a> e93b0080 7fa34800 409e036c > ---[ end trace ]--- > > Kernel panic - not syncing: Attempted to kill the idle task! > Rebooting in 10 seconds.. -- Cheers, Stephen Rothwell
Re: [PATCH V3 2/2] KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9
On Mon, 2017-05-29 at 20:12 +1000, Paul Mackerras wrote: > This allows userspace (e.g. QEMU) to enable large decrementer mode > for > the guest when running on a POWER9 host, by setting the LPCR_LD bit > in > the guest LPCR value. With this, the guest exit code saves 64 bits > of > the guest DEC value on exit. Other places that use the guest DEC > value check the LPCR_LD bit in the guest LPCR value, and if it is > set, > omit the 32-bit sign extension that would otherwise be done. > > This doesn't change the DEC emulation used by PR KVM because PR KVM > is not supported on POWER9 yet. > > This is partly based on an earlier patch by Oliver O'Halloran. > > Signed-off-by: Paul Mackerras Tested with a hacked up qemu and upstream guest/host (with these patches). Tested-by: Suraj Jitindar Singh > --- > arch/powerpc/include/asm/kvm_host.h | 2 +- > arch/powerpc/kvm/book3s_hv.c| 6 ++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 29 > - > arch/powerpc/kvm/emulate.c | 4 ++-- > 4 files changed, 33 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index 9c51ac4..3f879c8 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -579,7 +579,7 @@ struct kvm_vcpu_arch { > ulong mcsrr0; > ulong mcsrr1; > ulong mcsr; > - u32 dec; > + ulong dec; > #ifdef CONFIG_BOOKE > u32 decar; > #endif > diff --git a/arch/powerpc/kvm/book3s_hv.c > b/arch/powerpc/kvm/book3s_hv.c > index 42b7a4f..9b2eb66 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -1143,6 +1143,12 @@ static void kvmppc_set_lpcr(struct kvm_vcpu > *vcpu, u64 new_lpcr, > mask = LPCR_DPFD | LPCR_ILE | LPCR_TC; > if (cpu_has_feature(CPU_FTR_ARCH_207S)) > mask |= LPCR_AIL; > + /* > + * On POWER9, allow userspace to enable large decrementer > for the > + * guest, whether or not the host has it enabled. > + */ > + if (cpu_has_feature(CPU_FTR_ARCH_300)) > + mask |= LPCR_LD; > > /* Broken 32-bit version of LPCR must not clear top bits */ > if (preserve_top32) > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index e390b38..3c901b5 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -920,7 +920,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300) > mftbr7 > subfr3,r7,r8 > mtspr SPRN_DEC,r3 > - stw r3,VCPU_DEC(r4) > + std r3,VCPU_DEC(r4) > > ld r5, VCPU_SPRG0(r4) > ld r6, VCPU_SPRG1(r4) > @@ -1032,7 +1032,13 @@ kvmppc_cede_reentry: /* r4 = > vcpu, r13 = paca */ > li r0, BOOK3S_INTERRUPT_EXTERNAL > bne cr1, 12f > mfspr r0, SPRN_DEC > - cmpwi r0, 0 > +BEGIN_FTR_SECTION > + /* On POWER9 check whether the guest has large decrementer > enabled */ > + andis. r8, r8, LPCR_LD@h > + bne 15f > +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) > + extsw r0, r0 > +15: cmpdi r0, 0 > li r0, BOOK3S_INTERRUPT_DECREMENTER > bge 5f > > @@ -1459,12 +1465,18 @@ mc_cont: > mtspr SPRN_SPURR,r4 > > /* Save DEC */ > + ld r3, HSTATE_KVM_VCORE(r13) > mfspr r5,SPRN_DEC > mftbr6 > + /* On P9, if the guest has large decr enabled, don't sign > extend */ > +BEGIN_FTR_SECTION > + ld r4, VCORE_LPCR(r3) > + andis. r4, r4, LPCR_LD@h > + bne 16f > +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) > extsw r5,r5 > - add r5,r5,r6 > +16: add r5,r5,r6 > /* r5 is a guest timebase value here, convert to host TB */ > - ld r3,HSTATE_KVM_VCORE(r13) > ld r4,VCORE_TB_OFFSET(r3) > subfr5,r4,r5 > std r5,VCPU_DEC_EXPIRES(r9) > @@ -2376,8 +2388,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM) > mfspr r3, SPRN_DEC > mfspr r4, SPRN_HDEC > mftbr5 > +BEGIN_FTR_SECTION > + /* On P9 check whether the guest has large decrementer mode > enabled */ > + ld r6, HSTATE_KVM_VCORE(r13) > + ld r6, VCORE_LPCR(r6) > + andis. r6, r6, LPCR_LD@h > + bne 68f > +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) > extsw r3, r3 > - EXTEND_HDEC(r4) > +68: EXTEND_HDEC(r4) > cmpdr3, r4 > ble 67f > mtspr SPRN_DEC, r4 > diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c > index c873ffe..4d8b4d6 100644 > --- a/arch/powerpc/kvm/emulate.c > +++ b/arch/powerpc/kvm/emulate.c > @@ -39,7 +39,7 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu) > unsigned long dec_nsec; > unsigned long long dec_time; > > - pr_debug("mtDEC: %x\n", vcpu->arch.dec); > + pr_debug("mtDEC: %lx\n", vcpu->arch.dec); > hrtimer_try_to_cancel(&vcpu->arch.dec_ti