Re: [PATCH] powerpc: ignore the pkey system calls for now
On 18/10/16 09:33, Stephen Rothwell wrote: > Eliminates warning messages: > > :1316:2: warning: #warning syscall pkey_mprotect not implemented > [-Wcpp] > :1319:2: warning: #warning syscall pkey_alloc not implemented [-Wcpp] > :1322:2: warning: #warning syscall pkey_free not implemented [-Wcpp] > > Hopefully we will remember to revert this commit if we ever implement > them. > > Signed-off-by: Stephen Rothwell That makes sense for now Acked-by: Balbir Singh
[PATCH] Setup per-cpu cpu<->node binding early
Michael Ellerman debugged an issue w.r.t workqueue changes (see https://lkml.org/lkml/2016/10/17/352) down to the fact that we don't setup our per cpu (cpu to node) binding early enough (in setup_per_cpu_areas like x86 does). This lead to a problem with workqueue changes where the cpus seen by for_each_node() in workqueue_init_early() was different from their binding seen later in for_each_possible_cpu(cpu) { node = cpu_to_node(cpu) ... } In setup_arch()->initmem_init() we have access to the binding in numa_cpu_lookup_table() This patch implements Michael's suggestion of setting up the per cpu node binding inside of setup_per_cpu_areas() I did not remove the original setting of these values from smp_prepare_cpus(). I've also not setup per cpu mem's via set_cpu_numa_mem() since zonelists are not yet built by the time we do per cpu setup. Reported-by: Michael Ellerman Signed-off-by: Balbir Singh --- arch/powerpc/kernel/setup_64.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index c3e1290..842415a 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -625,6 +625,8 @@ void __init setup_per_cpu_areas(void) for_each_possible_cpu(cpu) { __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu]; paca[cpu].data_offset = __per_cpu_offset[cpu]; + + set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); } } #endif -- 2.5.5
[RFC PATCH 1/3] cxl: Split _CXL_LOOP_HCALL9 out into a separate macro
In a subsequent patch we want to change the type of retbuf between plpar_hcall() and plpar_hcall9(), but we can't because _CXL_LOOP_HCALL expects the same type for retbuf regardless of the hcall type. So duplicate the logic in a separate macro for plpar_hcall9(). Signed-off-by: Michael Ellerman --- drivers/misc/cxl/hcalls.c | 31 ++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/drivers/misc/cxl/hcalls.c b/drivers/misc/cxl/hcalls.c index d6d11f4056d7..01a8d917631c 100644 --- a/drivers/misc/cxl/hcalls.c +++ b/drivers/misc/cxl/hcalls.c @@ -77,8 +77,37 @@ msleep(delay); \ } \ } + +#define _CXL_LOOP_HCALL9(call, rc, retbuf, fn, ...)\ + { \ + unsigned int delay, total_delay = 0;\ + u64 token = 0; \ + \ + memset(retbuf, 0, sizeof(retbuf)); \ + while (1) { \ + rc = call(fn, retbuf, __VA_ARGS__, token); \ + token = retbuf[0]; \ + if (rc != H_BUSY && !H_IS_LONG_BUSY(rc))\ + break; \ + \ + if (rc == H_BUSY) \ + delay = 10; \ + else\ + delay = get_longbusy_msecs(rc); \ + \ + total_delay += delay; \ + if (total_delay > CXL_HCALL_TIMEOUT) { \ + WARN(1, "Warning: Giving up waiting for CXL hcall " \ +"%#x after %u msec\n", fn, total_delay); \ + rc = H_BUSY;\ + break; \ + } \ + msleep(delay); \ + } \ + } + #define CXL_H_WAIT_UNTIL_DONE(...) _CXL_LOOP_HCALL(plpar_hcall, __VA_ARGS__) -#define CXL_H9_WAIT_UNTIL_DONE(...) _CXL_LOOP_HCALL(plpar_hcall9, __VA_ARGS__) +#define CXL_H9_WAIT_UNTIL_DONE(...) _CXL_LOOP_HCALL9(plpar_hcall9, __VA_ARGS__) #define _PRINT_MSG(rc, format, ...)\ { \ -- 2.7.4
[RFC PATCH 2/3] powerpc/pseries: Define & use a type for the plpar_hcall() retvals
We have now had two nasty stack corruption bugs caused by incorrect sizing of the return buffer for plpar_hcall()/plpar_hcall9(). To avoid any more such bugs, define a type which encodes the size of the return buffer, and change the argument of plpar_hcall() to be of that type, meaning the compiler will check for us that we passed the right size buffer. There isn't an easy way to do this incrementally, without introducing a new function name, eg. plpar_hcall_with_struct(), which is ugly as hell. So just do it in one tree-wide change. Signed-off-by: Michael Ellerman --- If anyone can think of a tricky way of doing this without requiring us to update all usages of retbuf[x], then let me know. arch/powerpc/include/asm/hvcall.h | 16 +- arch/powerpc/include/asm/plpar_wrappers.h | 44 +- arch/powerpc/kernel/rtas.c | 6 ++-- arch/powerpc/platforms/pseries/hvconsole.c | 10 +++--- arch/powerpc/platforms/pseries/lparcfg.c | 14 - arch/powerpc/platforms/pseries/rng.c | 6 ++-- arch/powerpc/platforms/pseries/suspend.c | 6 ++-- arch/powerpc/sysdev/xics/icp-hv.c | 6 ++-- drivers/char/hw_random/pseries-rng.c | 6 ++-- drivers/misc/cxl/hcalls.c | 50 +++--- drivers/net/ethernet/ibm/ibmveth.h | 6 ++-- drivers/net/ethernet/ibm/ibmvnic.c | 8 ++--- 12 files changed, 90 insertions(+), 88 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 708edebcf147..b3a6c6ec6b6f 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -318,32 +318,34 @@ */ long plpar_hcall_norets(unsigned long opcode, ...); +struct plpar_hcall_retvals +{ + unsigned long v[4]; +}; + /** * plpar_hcall: - Make a pseries hypervisor call * @opcode: The hypervisor call to make. * @retbuf: Buffer to store up to 4 return arguments in. * - * This call supports up to 6 arguments and 4 return arguments. Use - * PLPAR_HCALL_BUFSIZE to size the return argument buffer. + * This call supports up to 6 arguments and 4 return arguments. * * Used for all but the craziest of phyp interfaces (see plpar_hcall9) */ -#define PLPAR_HCALL_BUFSIZE 4 -long plpar_hcall(unsigned long opcode, unsigned long *retbuf, ...); +long plpar_hcall(unsigned long opcode, struct plpar_hcall_retvals *retvals, ...); /** * plpar_hcall_raw: - Make a hypervisor call without calculating hcall stats * @opcode: The hypervisor call to make. * @retbuf: Buffer to store up to 4 return arguments in. * - * This call supports up to 6 arguments and 4 return arguments. Use - * PLPAR_HCALL_BUFSIZE to size the return argument buffer. + * This call supports up to 6 arguments and 4 return arguments. * * Used when phyp interface needs to be called in real mode. Similar to * plpar_hcall, but plpar_hcall_raw works in real mode and does not * calculate hypervisor call statistics. */ -long plpar_hcall_raw(unsigned long opcode, unsigned long *retbuf, ...); +long plpar_hcall_raw(unsigned long opcode, struct plpar_hcall_retvals *retvals, ...); /** * plpar_hcall9: - Make a pseries hypervisor call with up to 9 return arguments diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 1b394247afc2..17885cd60fb9 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -131,12 +131,12 @@ static inline long plpar_pte_enter(unsigned long flags, unsigned long hpte_group, unsigned long hpte_v, unsigned long hpte_r, unsigned long *slot) { + struct plpar_hcall_retvals retvals; long rc; - unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; - rc = plpar_hcall(H_ENTER, retbuf, flags, hpte_group, hpte_v, hpte_r); + rc = plpar_hcall(H_ENTER, &retvals, flags, hpte_group, hpte_v, hpte_r); - *slot = retbuf[0]; + *slot = retvals.v[0]; return rc; } @@ -145,13 +145,13 @@ static inline long plpar_pte_remove(unsigned long flags, unsigned long ptex, unsigned long avpn, unsigned long *old_pteh_ret, unsigned long *old_ptel_ret) { + struct plpar_hcall_retvals retvals; long rc; - unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; - rc = plpar_hcall(H_REMOVE, retbuf, flags, ptex, avpn); + rc = plpar_hcall(H_REMOVE, &retvals, flags, ptex, avpn); - *old_pteh_ret = retbuf[0]; - *old_ptel_ret = retbuf[1]; + *old_pteh_ret = retvals.v[0]; + *old_ptel_ret = retvals.v[1]; return rc; } @@ -161,13 +161,13 @@ static inline long plpar_pte_remove_raw(unsigned long flags, unsigned long ptex, unsigned long avpn, unsigned long *old_pteh_ret, unsigned long *old_ptel_ret) { + struct plpar_hcall_retvals retvals; long rc; - unsigned long retbuf[PLPAR_
[RFC PATCH 3/3] powerpc/pseries: Define & use a type for the plpar_hcall9() retvals
We have now had two nasty stack corruption bugs caused by incorrect sizing of the return buffer for plpar_hcall()/plpar_hcall9(). To avoid any more such bugs, define a type which encodes the size of the return buffer, and change the argument of plpar_hcall9() to be of that type, meaning the compiler will check for us that we passed the right size buffer. There isn't an easy way to do this incrementally, without introducing a new function name, eg. plpar_hcall9_with_struct(), which is ugly as hell. So just do it in one tree-wide change. Signed-off-by: Michael Ellerman --- arch/powerpc/include/asm/hvcall.h | 15 ++-- arch/powerpc/include/asm/plpar_wrappers.h | 12 +-- arch/powerpc/mm/numa.c | 6 +- arch/powerpc/platforms/pseries/lpar.c | 46 +- arch/powerpc/platforms/pseries/lparcfg.c| 28 +++ arch/powerpc/platforms/pseries/pseries_energy.c | 12 +-- drivers/misc/cxl/hcalls.c | 32 +++ drivers/net/ethernet/ibm/ehea/ehea_phyp.c | 107 drivers/net/ethernet/ibm/ibmveth.h | 8 +- 9 files changed, 136 insertions(+), 130 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index b3a6c6ec6b6f..c2728ab84b4f 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -347,17 +347,20 @@ long plpar_hcall(unsigned long opcode, struct plpar_hcall_retvals *retvals, ...) */ long plpar_hcall_raw(unsigned long opcode, struct plpar_hcall_retvals *retvals, ...); +struct plpar_hcall9_retvals +{ + unsigned long v[9]; +}; + /** * plpar_hcall9: - Make a pseries hypervisor call with up to 9 return arguments * @opcode: The hypervisor call to make. - * @retbuf: Buffer to store up to 9 return arguments in. + * @retvals: Buffer to store up to 9 return arguments in. * - * This call supports up to 9 arguments and 9 return arguments. Use - * PLPAR_HCALL9_BUFSIZE to size the return argument buffer. + * This call supports up to 9 arguments and 9 return arguments. */ -#define PLPAR_HCALL9_BUFSIZE 9 -long plpar_hcall9(unsigned long opcode, unsigned long *retbuf, ...); -long plpar_hcall9_raw(unsigned long opcode, unsigned long *retbuf, ...); +long plpar_hcall9(unsigned long opcode, struct plpar_hcall9_retvals *retvals, ...); +long plpar_hcall9_raw(unsigned long opcode, struct plpar_hcall9_retvals *retvals, ...); /* For hcall instrumentation. One structure per-hcall, per-CPU */ struct hcall_stats { diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 17885cd60fb9..865bc9a726e4 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -209,11 +209,11 @@ static inline long plpar_pte_read_4(unsigned long flags, unsigned long ptex, { long rc; - unsigned long retbuf[PLPAR_HCALL9_BUFSIZE]; + struct plpar_hcall9_retvals retvals; - rc = plpar_hcall9(H_READ, retbuf, flags | H_READ_4, ptex); + rc = plpar_hcall9(H_READ, &retvals, flags | H_READ_4, ptex); - memcpy(ptes, retbuf, 8*sizeof(unsigned long)); + memcpy(ptes, &retvals.v, 8*sizeof(unsigned long)); return rc; } @@ -227,11 +227,11 @@ static inline long plpar_pte_read_4_raw(unsigned long flags, unsigned long ptex, { long rc; - unsigned long retbuf[PLPAR_HCALL9_BUFSIZE]; + struct plpar_hcall9_retvals retvals; - rc = plpar_hcall9_raw(H_READ, retbuf, flags | H_READ_4, ptex); + rc = plpar_hcall9_raw(H_READ, &retvals, flags | H_READ_4, ptex); - memcpy(ptes, retbuf, 8*sizeof(unsigned long)); + memcpy(ptes, &retvals.v, 8*sizeof(unsigned long)); return rc; } diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 75b9cd6150cc..aeaf1b1cb893 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1277,12 +1277,12 @@ static int update_cpu_associativity_changes_mask(void) static long hcall_vphn(unsigned long cpu, __be32 *associativity) { long rc; - long retbuf[PLPAR_HCALL9_BUFSIZE] = {0}; + struct plpar_hcall9_retvals retvals = { 0 }; u64 flags = 1; int hwcpu = get_hard_smp_processor_id(cpu); - rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, hwcpu); - vphn_unpack_associativity(retbuf, associativity); + rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, &retvals, flags, hwcpu); + vphn_unpack_associativity(&retvals.v[0], associativity); return rc; } diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 86707e67843f..d191df0c1535 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -394,6 +394,7 @@ static void __pSeries_lpar_hugepage_invalidate(unsigned long *slot, int psize, int ssize) { u
[PATCH] powerpc: Fix numa topology console print
With recent update to printk, we get console output like below [0.550639] Brought up 160 CPUs [0.550718] Node 0 CPUs: [0.550721] 0 [0.550754] -39 [0.550794] Node 1 CPUs: [0.550798] 40 [0.550817] -79 [0.550856] Node 16 CPUs: [0.550860] 80 [0.550880] -119 [0.550917] Node 17 CPUs: [0.550923] 120 [0.550942] -159 Fix this by using proper printk args Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/numa.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 75b9cd6150cc..c8fd54ec40e4 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -845,7 +845,7 @@ void __init dump_numa_cpu_topology(void) return; for_each_online_node(node) { - printk(KERN_DEBUG "Node %d CPUs:", node); + pr_info("Node %d CPUs:", node); count = 0; /* @@ -856,18 +856,18 @@ void __init dump_numa_cpu_topology(void) if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) { if (count == 0) - printk(" %u", cpu); + pr_cont(" %u", cpu); ++count; } else { if (count > 1) - printk("-%u", cpu - 1); + pr_cont("-%u", cpu - 1); count = 0; } } if (count > 1) - printk("-%u", nr_cpu_ids - 1); - printk("\n"); + pr_cont("-%u", nr_cpu_ids - 1); + pr_cont("\n"); } } @@ -882,7 +882,7 @@ static void __init dump_numa_memory_topology(void) for_each_online_node(node) { unsigned long i; - printk(KERN_DEBUG "Node %d Memory:", node); + pr_info("Node %d Memory:", node); count = 0; @@ -890,18 +890,18 @@ static void __init dump_numa_memory_topology(void) i += (1 << SECTION_SIZE_BITS)) { if (early_pfn_to_nid(i >> PAGE_SHIFT) == node) { if (count == 0) - printk(" 0x%lx", i); + pr_cont(" 0x%lx", i); ++count; } else { if (count > 0) - printk("-0x%lx", i); + pr_cont("-0x%lx", i); count = 0; } } if (count > 0) - printk("-0x%lx", i); - printk("\n"); + pr_cont("-0x%lx", i); + pr_cont("\n"); } } -- 2.10.1
Re: [PATCH v2] console: Don't prefer first registered if DT specifies stdout-path
On Monday, 17 October 2016 19:39:57 BST Andreas Schwab wrote: > On Okt 17 2016, Paul Burton wrote: > > Could you share the device tree from your system? > > This is the contents of chosen/linux,stdout-path on the systems I have: > > chosen/linux,stdout-path > "/pci@f000/ATY,SnowyParent@10/ATY,Snowy_A@0" > > chosen/linux,stdout-path > "/pci@0,f000/NVDA,Parent@10/NVDA,Display-B@1" > > Is that what you need? There is also chosen/stdout, but no > aliases/stdout. > > Andreas. Hi Andreas, I think I see the problem & I'm hoping this patch will fix it: https://lkml.org/lkml/2016/10/18/142 Could you give it a try & let me know? Thanks, Paul signature.asc Description: This is a digitally signed message part.
[PATCH] console: use first console if stdout-path device doesn't appear
If a device tree specified a preferred device for kernel console output via the stdout-path or linux,stdout-path chosen node properties there's no guarantee that it will have specified a device for which we have a driver. It may also be the case that we do have a driver but it doesn't call of_console_check() to register as a preferred console (eg. offb driver as used on powermac systems). In these cases try to ensure that we provide some console output by enabling the first console in the console_drivers list. As I don't have access to an affected system this has only been build tested - testing would be most appreciated. Signed-off-by: Paul Burton Fixes: 05fd007e4629 ("console: don't prefer first registered if DT specifies stdout-path") Reported-by: Andreas Schwab Cc: Andreas Schwab Cc: Andrew Morton Cc: Petr Mladek Cc: Sergey Senozhatsky Cc: Borislav Petkov Cc: Tejun Heo Cc: linux-ker...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org --- A potential alternative to this might be to have the affected offb driver call of_check_console(), and perhaps that should happen anyway, but doing so seems non-trivial since the offb driver doesn't know the index of the framebuffer console device it may be about to register & the fbdev core doesn't know the associated device tree node. This also wouldn't catch the case of us not having a driver for the device specified by stdout-path, so this fallback seems worthwhile anyway. --- kernel/printk/printk.c | 37 - 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index d5e3973..7091e2f 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2835,10 +2835,45 @@ EXPORT_SYMBOL(unregister_console); * intersects with the init section. Note that code exists elsewhere to get * rid of the boot console as soon as the proper console shows up, so there * won't be side-effects from postponing the removal. + * + * Additionally we may be using a device tree which specifies valid + * stdout-path referencing a device for which we don't have a driver, or for + * which we have a driver that doesn't register itself as preferred console + * using of_console_check(). In these cases we attempt here to enable the + * first registered console. */ static int __init printk_late_init(void) { - struct console *con; + struct console *con, *enabled; + + if (of_specified_console) { + console_lock(); + + /* Find the enabled console, if there is one */ + enabled = NULL; + for_each_console(con) { + if (!(con->flags & CON_ENABLED)) + continue; + + enabled = con; + break; + } + + /* Enable the first console if none were already enabled */ + con = console_drivers; + if (!enabled && con) { + if (con->index < 0) + con->index = 0; + if (con->setup == NULL || + con->setup(con, NULL) == 0) { + con->flags |= CON_ENABLED; + if (con->device) + con->flags |= CON_CONSDEV; + } + } + + console_unlock(); + } for_each_console(con) { if (!keep_bootcon && con->flags & CON_BOOT) { -- 2.10.0
Re: [v12, 0/8] Fix eSDHC host version register bug
On 21 September 2016 at 08:57, Yangbo Lu wrote: > This patchset is used to fix a host version register bug in the > T4240-R1.0-R2.0 > eSDHC controller. To match the SoC version and revision, 10 previous version > patchsets had tried many methods but all of them were rejected by reviewers. > Such as > - dts compatible method > - syscon method > - ifdef PPC method > - GUTS driver getting SVR method > Anrd suggested a soc_device_match method in v10, and this is the only > available > method left now. This v11 patchset introduces the soc_device_match interface > in > soc driver. > > The first six patches of Yangbo are to add the GUTS driver. This is used to > register a soc device which contain soc version and revision information. > The other two patches introduce the soc_device_match method in soc driver > and apply it on esdhc driver to fix this bug. > > Arnd Bergmann (1): > base: soc: introduce soc_device_match() interface > > Yangbo Lu (7): > dt: bindings: update Freescale DCFG compatible > ARM64: dts: ls2080a: add device configuration node > dt: bindings: move guts devicetree doc out of powerpc directory > powerpc/fsl: move mpc85xx.h to include/linux/fsl > soc: fsl: add GUTS driver for QorIQ platforms > MAINTAINERS: add entry for Freescale SoC drivers > mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0 > > Documentation/devicetree/bindings/arm/fsl.txt | 6 +- > .../bindings/{powerpc => soc}/fsl/guts.txt | 3 + > MAINTAINERS| 11 +- > arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 + > arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- > arch/powerpc/sysdev/fsl_pci.c | 2 +- > drivers/base/Kconfig | 1 + > drivers/base/soc.c | 66 ++ > drivers/clk/clk-qoriq.c| 3 +- > drivers/i2c/busses/i2c-mpc.c | 2 +- > drivers/iommu/fsl_pamu.c | 3 +- > drivers/mmc/host/Kconfig | 1 + > drivers/mmc/host/sdhci-of-esdhc.c | 20 ++ > drivers/net/ethernet/freescale/gianfar.c | 2 +- > drivers/soc/Kconfig| 2 +- > drivers/soc/fsl/Kconfig| 19 ++ > drivers/soc/fsl/Makefile | 1 + > drivers/soc/fsl/guts.c | 257 > + > include/linux/fsl/guts.h | 125 ++ > .../asm/mpc85xx.h => include/linux/fsl/svr.h | 4 +- > include/linux/sys_soc.h| 3 + > 21 files changed, 478 insertions(+), 61 deletions(-) > rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) > create mode 100644 drivers/soc/fsl/Kconfig > create mode 100644 drivers/soc/fsl/guts.c > rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%) > > -- > 2.1.0.27.g96db324 > This looks good to me! I am not sure which tree you want this to be picked up through, but unless no other volunteers I can take it through my mmc tree. Although, before considering to apply, I need an ack from Scott/Arnd for the guts driver in patch 5/8 and I need an ack from Greg for patch 7/8, where the soc_device_match() interface is added (seems like you didn't add him on cc/to). Kind regards Uffe
Re: [PATCH 01/10] mm: remove write/force parameters from __get_user_pages_locked()
On Thu 13-10-16 01:20:11, Lorenzo Stoakes wrote: > This patch removes the write and force parameters from > __get_user_pages_locked() > to make the use of FOLL_FORCE explicit in callers as use of this flag can > result > in surprising behaviour (and hence bugs) within the mm subsystem. > > Signed-off-by: Lorenzo Stoakes Looks good. You can add: Reviewed-by: Jan Kara Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 02/10] mm: remove write/force parameters from __get_user_pages_unlocked()
On Thu 13-10-16 01:20:12, Lorenzo Stoakes wrote: > This patch removes the write and force parameters from > __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in callers > as > use of this flag can result in surprising behaviour (and hence bugs) within > the > mm subsystem. > > Signed-off-by: Lorenzo Stoakes The patch looks good. You can add: Reviewed-by: Jan Kara Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 03/10] mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
On Thu 13-10-16 01:20:13, Lorenzo Stoakes wrote: > This patch removes the write and force parameters from > get_user_pages_unlocked() > and replaces them with a gup_flags parameter to make the use of FOLL_FORCE > explicit in callers as use of this flag can result in surprising behaviour > (and > hence bugs) within the mm subsystem. > > Signed-off-by: Lorenzo Stoakes Looks good. You can add: Reviewed-by: Jan Kara Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags
On Thu 13-10-16 01:20:14, Lorenzo Stoakes wrote: > This patch removes the write and force parameters from get_user_pages_locked() > and replaces them with a gup_flags parameter to make the use of FOLL_FORCE > explicit in callers as use of this flag can result in surprising behaviour > (and > hence bugs) within the mm subsystem. > > Signed-off-by: Lorenzo Stoakes > --- > include/linux/mm.h | 2 +- > mm/frame_vector.c | 8 +++- > mm/gup.c | 12 +++- > mm/nommu.c | 5 - > 4 files changed, 15 insertions(+), 12 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 6adc4bc..27ab538 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1282,7 +1282,7 @@ long get_user_pages(unsigned long start, unsigned long > nr_pages, > int write, int force, struct page **pages, > struct vm_area_struct **vmas); > long get_user_pages_locked(unsigned long start, unsigned long nr_pages, > - int write, int force, struct page **pages, int *locked); > + unsigned int gup_flags, struct page **pages, int *locked); Hum, the prototype is inconsistent with e.g. __get_user_pages_unlocked() where gup_flags come after **pages argument. Actually it makes more sense to have it before **pages so that input arguments come first and output arguments second but I don't care that much. But it definitely should be consistent... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags
On Tue, Oct 18, 2016 at 02:54:25PM +0200, Jan Kara wrote: > > @@ -1282,7 +1282,7 @@ long get_user_pages(unsigned long start, unsigned > > long nr_pages, > > int write, int force, struct page **pages, > > struct vm_area_struct **vmas); > > long get_user_pages_locked(unsigned long start, unsigned long nr_pages, > > - int write, int force, struct page **pages, int *locked); > > + unsigned int gup_flags, struct page **pages, int *locked); > > Hum, the prototype is inconsistent with e.g. __get_user_pages_unlocked() > where gup_flags come after **pages argument. Actually it makes more sense > to have it before **pages so that input arguments come first and output > arguments second but I don't care that much. But it definitely should be > consistent... It was difficult to decide quite how to arrange parameters as there was inconsitency with regards to parameter ordering already - for example __get_user_pages() places its flags argument before pages whereas, as you note, __get_user_pages_unlocked() puts them afterwards. I ended up compromising by trying to match the existing ordering of the function as much as I could by replacing write, force pairs with gup_flags in the same location (with the exception of get_user_pages_unlocked() which I felt should match __get_user_pages_unlocked() in signature) or if there was already a gup_flags parameter as in the case of __get_user_pages_unlocked() I simply removed the write, force pair and left the flags as the last parameter. I am happy to rearrange parameters as needed, however I am not sure if it'd be worthwhile for me to do so (I am keen to try to avoid adding too much noise here :) If we were to rearrange parameters for consistency I'd suggest adjusting __get_user_pages_unlocked() to put gup_flags before pages and do the same with get_user_pages_unlocked(), let me know what you think.
Re: [RFC PATCH 2/3] powerpc/pseries: Define & use a type for the plpar_hcall() retvals
On 18/10/16 19:40, Michael Ellerman wrote: > We have now had two nasty stack corruption bugs caused by incorrect > sizing of the return buffer for plpar_hcall()/plpar_hcall9(). > > To avoid any more such bugs, define a type which encodes the size of the > return buffer, and change the argument of plpar_hcall() to be of that > type, meaning the compiler will check for us that we passed the right > size buffer. > > There isn't an easy way to do this incrementally, without introducing a > new function name, eg. plpar_hcall_with_struct(), which is ugly as hell. > So just do it in one tree-wide change. > Conceptually looks god, but I think we need to abstract the return values as well. I'll test and see if I can send you something on top of this Balbir
Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags
On Thu 13-10-16 01:20:10, Lorenzo Stoakes wrote: > This patch series adjusts functions in the get_user_pages* family such that > desired FOLL_* flags are passed as an argument rather than implied by flags. > > The purpose of this change is to make the use of FOLL_FORCE explicit so it is > easier to grep for and clearer to callers that this flag is being used. The > use > of FOLL_FORCE is an issue as it overrides missing VM_READ/VM_WRITE flags for > the > VMA whose pages we are reading from/writing to, which can result in surprising > behaviour. > > The patch series came out of the discussion around commit 38e0885, which > addressed a BUG_ON() being triggered when a page was faulted in with PROT_NONE > set but having been overridden by FOLL_FORCE. do_numa_page() was run on the > assumption the page _must_ be one marked for NUMA node migration as an actual > PROT_NONE page would have been dealt with prior to this code path, however > FOLL_FORCE introduced a situation where this assumption did not hold. > > See https://marc.info/?l=linux-mm&m=147585445805166 for the patch proposal. I like this cleanup. Tracking FOLL_FORCE users was always a nightmare and the flag behavior is really subtle so we should better be explicit about it. I haven't gone through each patch separately but rather applied the whole series and checked the resulting diff. This all seems OK to me and feel free to add Acked-by: Michal Hocko I am wondering whether we can go further. E.g. it is not really clear to me whether we need an explicit FOLL_REMOTE when we can in fact check mm != current->mm and imply that. Maybe there are some contexts which wouldn't work, I haven't checked. Then I am also wondering about FOLL_TOUCH behavior. __get_user_pages_unlocked has only few callers which used to be get_user_pages_unlocked before 1e9877902dc7e ("mm/gup: Introduce get_user_pages_remote()"). To me a dropped FOLL_TOUCH seems unintentional. Now that get_user_pages_unlocked has gup_flags argument I guess we might want to get rid of the __g-u-p-u version altogether, no? __get_user_pages is quite low level and imho shouldn't be exported. It's only user - kvm - should rather pull those two functions to gup instead and export them. There is nothing really KVM specific in them. I also cannot say I would be entirely thrilled about get_user_pages_locked, we only have one user which can simply do lock g-u-p unlock AFAICS. I guess there is more work in that area and I do not want to impose all that work on you, but I couldn't resist once I saw you playing in that area ;) Definitely a good start! -- Michal Hocko SUSE Labs
[PATCH] mm/hugetlb: Use the right pte val for compare in hugetlb_cow
We cannot use the pte value used in set_pte_at for pte_same comparison, because archs like ppc64, filter/add new pte flag in set_pte_at. Instead fetch the pte value inside hugetlb_cow. We are comparing pte value to make sure the pte didn't change since we dropped the page table lock. hugetlb_cow get called with page table lock held, and we can take a copy of the pte value before we drop the page table lock. With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no previous mapping (huge_pte_none entries), by forcing a cow in the fault path. This avoid take an addition fault to covert a read-only mapping to read/write. Here we were comparing a recently instantiated pte (via set_pte_at) to the pte values from linux page table. As explained above on ppc64 such pte_same check returned wrong result, resulting in us taking an additional fault on ppc64. Fixes: 6a119eae942c ("powerpc/mm: Add a _PAGE_PTE bit") Reported-by: Jan Stancek Signed-off-by: Aneesh Kumar K.V --- mm/hugetlb.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ec49d9ef1eef..da8fbd02b92e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3386,15 +3386,17 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, * Keep the pte_same checks anyway to make transition from the mutex easier. */ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep, pte_t pte, - struct page *pagecache_page, spinlock_t *ptl) + unsigned long address, pte_t *ptep, + struct page *pagecache_page, spinlock_t *ptl) { + pte_t pte; struct hstate *h = hstate_vma(vma); struct page *old_page, *new_page; int ret = 0, outside_reserve = 0; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ + pte = huge_ptep_get(ptep); old_page = pte_page(pte); retry_avoidcopy: @@ -3668,7 +3670,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, hugetlb_count_add(pages_per_huge_page(h), mm); if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl); + ret = hugetlb_cow(mm, vma, address, ptep, page, ptl); } spin_unlock(ptl); @@ -3822,8 +3824,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_WRITE) { if (!huge_pte_write(entry)) { - ret = hugetlb_cow(mm, vma, address, ptep, entry, - pagecache_page, ptl); + ret = hugetlb_cow(mm, vma, address, ptep, + pagecache_page, ptl); goto out_put_page; } entry = huge_pte_mkdirty(entry); -- 2.10.1
[PATCH v6 0/3] powerpc/pseries: Implement indexed-count memory hotplug
Indexed-count memory management allows addition and removal of contiguous lmb blocks with a single command. When compared to the series of calls previously required to manage contiguous blocks, indexed-count decreases command frequency and reduces risk of buffer overflow. -Nathan --- Changes in v2: -- -[PATCH 1/2]: -remove potential memory leak when parsing command -use u32s drc_index and count instead of u32 ic[] in dlpar_memory -[PATCH 2/2]: -use u32s drc_index and count instead of u32 ic[] in dlpar_memory Changes in v3: -- -[PATCH 1/2]: -add logic to handle invalid drc_index input -update indexed-count trigger to follow naming convention -update dlpar_memory to follow kernel if-else style -[PATCH 2/2]: -add logic to handle invalid drc_index input Changes in v4: -- -[PATCH 1/2]: -add logic to handle kstrdup failure -add logic to handle invalid command format Changes in v5: -- -[PATCH 2/2]: -update for() loop to use start_index Changes in v6: - -[PATCH 1/3]: -Added new patch 1 that corrects reading past the end of the sysfs buffer when parsing sysfs input. -[PATCH 2/3]: -Updated sysfs buffer parsing Nathan Fontenot (1): powerpc/pseries: Correct possible read beyond dlpar sysfs buffer Sahil Mehta (2): powerpc/pseries: Implement indexed-count hotplug memory add powerpc/pseries: Implement indexed-count hotplug memory remove arch/powerpc/include/asm/rtas.h |2 arch/powerpc/platforms/pseries/dlpar.c | 177 +++- arch/powerpc/platforms/pseries/hotplug-memory.c | 200 ++- 3 files changed, 323 insertions(+), 56 deletions(-)
[PATCH v6 1/3] powerpc/pseries: Correct possible read beyond dlpar sysfs buffer
The pasrsing of data written to the dlpar file in sysfs does not correctly account for the possibility of reading past the end of the buffer. Correct this by updating the buffer parsing code to make a local copy and use the strsep() and sysfs_streq() routines to parse the buffer. This also separates the parsing code into subroutines to make cleaner. Signed-off-by: Nathan Fontenot --- arch/powerpc/platforms/pseries/dlpar.c | 141 ++-- 1 file changed, 96 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 423e450..ec5d677 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -418,84 +418,135 @@ void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog, } } -static ssize_t dlpar_store(struct class *class, struct class_attribute *attr, - const char *buf, size_t count) +static int dlpar_parse_resource(char **cmd, struct pseries_hp_errorlog *hp_elog) { - struct pseries_hp_errorlog *hp_elog; - struct completion hotplug_done; - const char *arg; - int rc; + char *arg; - hp_elog = kzalloc(sizeof(*hp_elog), GFP_KERNEL); - if (!hp_elog) { - rc = -ENOMEM; - goto dlpar_store_out; - } + arg = strsep(cmd, " "); + if (!arg) + return -EINVAL; - /* Parse out the request from the user, this will be in the form -* -*/ - arg = buf; - if (!strncmp(arg, "memory", 6)) { + if (sysfs_streq(arg, "memory")) { hp_elog->resource = PSERIES_HP_ELOG_RESOURCE_MEM; - arg += strlen("memory "); - } else if (!strncmp(arg, "cpu", 3)) { + } else if (sysfs_streq(arg, "cpu")) { hp_elog->resource = PSERIES_HP_ELOG_RESOURCE_CPU; - arg += strlen("cpu "); } else { - pr_err("Invalid resource specified: \"%s\"\n", buf); - rc = -EINVAL; - goto dlpar_store_out; + pr_err("Invalid resource specified.\n"); + return -EINVAL; } - if (!strncmp(arg, "add", 3)) { + return 0; +} + +static int dlpar_parse_action(char **cmd, struct pseries_hp_errorlog *hp_elog) +{ + char *arg; + + arg = strsep(cmd, " "); + if (!arg) + return -EINVAL; + + if (sysfs_streq(arg, "add")) { hp_elog->action = PSERIES_HP_ELOG_ACTION_ADD; - arg += strlen("add "); - } else if (!strncmp(arg, "remove", 6)) { + } else if (sysfs_streq(arg, "remove")) { hp_elog->action = PSERIES_HP_ELOG_ACTION_REMOVE; - arg += strlen("remove "); } else { - pr_err("Invalid action specified: \"%s\"\n", buf); - rc = -EINVAL; - goto dlpar_store_out; + pr_err("Invalid action specified.\n"); + return -EINVAL; } - if (!strncmp(arg, "index", 5)) { - u32 index; + return 0; +} +static int dlpar_parse_id_type(char **cmd, struct pseries_hp_errorlog *hp_elog) +{ + char *arg; + u32 count, index; + + arg = strsep(cmd, " "); + if (!arg) + return -EINVAL; + + if (sysfs_streq(arg, "index")) { hp_elog->id_type = PSERIES_HP_ELOG_ID_DRC_INDEX; - arg += strlen("index "); + arg = strsep(cmd, " "); + if (!arg) { + pr_err("No DRC Index specified.\n"); + return -EINVAL; + } + if (kstrtou32(arg, 0, &index)) { - rc = -EINVAL; - pr_err("Invalid drc_index specified: \"%s\"\n", buf); - goto dlpar_store_out; + pr_err("Invalid DRC Index specified.\n"); + return -EINVAL; } hp_elog->_drc_u.drc_index = cpu_to_be32(index); - } else if (!strncmp(arg, "count", 5)) { - u32 count; - + } else if (sysfs_streq(arg, "count")) { hp_elog->id_type = PSERIES_HP_ELOG_ID_DRC_COUNT; - arg += strlen("count "); + arg = strsep(cmd, " "); + if (!arg) { + pr_err("No DRC count specified.\n"); + return -EINVAL; + } + if (kstrtou32(arg, 0, &count)) { - rc = -EINVAL; - pr_err("Invalid count specified: \"%s\"\n", buf); - goto dlpar_store_out; + pr_err("Invalid DRC count specified.\n"); + return -EINVAL; } hp_elog->_drc_u.drc_count = cpu_to_be32(count); } else { - pr_err("Invalid id_type specified: \"%s\"\n",
[PATCH v6 2/3] powerpc/pseries: Implement indexed-count hotplug memory add
From: Sahil Mehta Indexed-count add for memory hotplug guarantees that a contiguous block of lmbs beginning at a specified will be assigned (NOT that lmbs will be added). Because of Qemu's per-DIMM memory management, the addition of a contiguous block of memory currently requires a series of individual calls. Indexed-count add reduces this series into a single call. Signed-off-by: Sahil Mehta Signed-off-by: Nathan Fontenot --- v2: -remove potential memory leak when parsing command -use u32s drc_index and count instead of u32 ic[] in dlpar_memory v3: -add logic to handle invalid drc_index input -update indexed-count trigger to follow naming convention -update dlpar_memory to follow kernel if-else style v4: -add logic to handle kstrdup failure -add logic to handle invalid command format v5: -none v6: -Corrected the sysfs buffer parsing to align with the updated parsing code in the new patch 1 of the series. arch/powerpc/include/asm/rtas.h |2 arch/powerpc/platforms/pseries/dlpar.c | 38 +++- arch/powerpc/platforms/pseries/hotplug-memory.c | 110 +-- 3 files changed, 138 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 9c23baa..5d65de7 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -307,6 +307,7 @@ struct pseries_hp_errorlog { union { __be32 drc_index; __be32 drc_count; + __be32 indexed_count[2]; chardrc_name[1]; } _drc_u; }; @@ -322,6 +323,7 @@ struct pseries_hp_errorlog { #define PSERIES_HP_ELOG_ID_DRC_NAME1 #define PSERIES_HP_ELOG_ID_DRC_INDEX 2 #define PSERIES_HP_ELOG_ID_DRC_COUNT 3 +#define PSERIES_HP_ELOG_ID_DRC_IC 4 struct pseries_errorlog *get_pseries_errorlog(struct rtas_error_log *log, uint16_t section_id); diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index ec5d677..2e01b1d 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -354,11 +354,17 @@ static int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog) switch (hp_elog->id_type) { case PSERIES_HP_ELOG_ID_DRC_COUNT: hp_elog->_drc_u.drc_count = - be32_to_cpu(hp_elog->_drc_u.drc_count); + be32_to_cpu(hp_elog->_drc_u.drc_count); break; case PSERIES_HP_ELOG_ID_DRC_INDEX: hp_elog->_drc_u.drc_index = - be32_to_cpu(hp_elog->_drc_u.drc_index); + be32_to_cpu(hp_elog->_drc_u.drc_index); + break; + case PSERIES_HP_ELOG_ID_DRC_IC: + hp_elog->_drc_u.indexed_count[0] = + be32_to_cpu(hp_elog->_drc_u.indexed_count[0]); + hp_elog->_drc_u.indexed_count[1] = + be32_to_cpu(hp_elog->_drc_u.indexed_count[1]); } switch (hp_elog->resource) { @@ -467,7 +473,33 @@ static int dlpar_parse_id_type(char **cmd, struct pseries_hp_errorlog *hp_elog) if (!arg) return -EINVAL; - if (sysfs_streq(arg, "index")) { + if (sysfs_streq(arg, "indexed-count")) { + hp_elog->id_type = PSERIES_HP_ELOG_ID_DRC_IC; + arg = strsep(cmd, " "); + if (!arg) { + pr_err("No DRC count specified.\n"); + return -EINVAL; + } + + if (kstrtou32(arg, 0, &count)) { + pr_err("Invalid DRC count specified.\n"); + return -EINVAL; + } + + arg = strsep(cmd, " "); + if (!arg) { + pr_err("No DRC Index specified.\n"); + return -EINVAL; + } + + if (kstrtou32(arg, 0, &index)) { + pr_err("Invalid DRC Index specified.\n"); + return -EINVAL; + } + + hp_elog->_drc_u.indexed_count[0] = cpu_to_be32(count); + hp_elog->_drc_u.indexed_count[1] = cpu_to_be32(index); + } else if (sysfs_streq(arg, "index")) { hp_elog->id_type = PSERIES_HP_ELOG_ID_DRC_INDEX; arg = strsep(cmd, " "); if (!arg) { diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 76ec104..badc66d 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -725,6 +725,89 @@ static int dlpar_memory_add_by_index(u32 drc_index, struct property *prop) return rc; } +static int dlpar_
[PATCH v6 3/3] powerpc/pseries: Implement indexed-count hotplug memory remove
From: Sahil Mehta Indexed-count remove for memory hotplug guarantees that a contiguous block of lmbs beginning at a specified will be unassigned (NOT that lmbs will be removed). Because of Qemu's per-DIMM memory management, the removal of a contiguous block of memory currently requires a series of individual calls. Indexed-count remove reduces this series into a single call. Signed-off-by: Sahil Mehta Signed-off-by: Nathan Fontenot --- v2: -use u32s drc_index and count instead of u32 ic[] in dlpar_memory v3: -add logic to handle invalid drc_index input v4: -none v5: -Update for() loop to start at start_index v6: -none arch/powerpc/platforms/pseries/hotplug-memory.c | 90 +++ 1 file changed, 90 insertions(+) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index badc66d..19ad081 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -558,6 +558,92 @@ static int dlpar_memory_remove_by_index(u32 drc_index, struct property *prop) return rc; } +static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index, +struct property *prop) +{ + struct of_drconf_cell *lmbs; + u32 num_lmbs, *p; + int i, rc, start_lmb_found; + int lmbs_available = 0, start_index = 0, end_index; + + pr_info("Attempting to hot-remove %u LMB(s) at %x\n", + lmbs_to_remove, drc_index); + + if (lmbs_to_remove == 0) + return -EINVAL; + + p = prop->value; + num_lmbs = *p++; + lmbs = (struct of_drconf_cell *)p; + start_lmb_found = 0; + + /* Navigate to drc_index */ + while (start_index < num_lmbs) { + if (lmbs[start_index].drc_index == drc_index) { + start_lmb_found = 1; + break; + } + + start_index++; + } + + if (!start_lmb_found) + return -EINVAL; + + end_index = start_index + lmbs_to_remove; + + /* Validate that there are enough LMBs to satisfy the request */ + for (i = start_index; i < end_index; i++) { + if (lmbs[i].flags & DRCONF_MEM_RESERVED) + break; + + lmbs_available++; + } + + if (lmbs_available < lmbs_to_remove) + return -EINVAL; + + for (i = start_index; i < end_index; i++) { + if (!(lmbs[i].flags & DRCONF_MEM_ASSIGNED)) + continue; + + rc = dlpar_remove_lmb(&lmbs[i]); + if (rc) + break; + + lmbs[i].reserved = 1; + } + + if (rc) { + pr_err("Memory indexed-count-remove failed, adding any removed LMBs\n"); + + for (i = start_index; i < end_index; i++) { + if (!lmbs[i].reserved) + continue; + + rc = dlpar_add_lmb(&lmbs[i]); + if (rc) + pr_err("Failed to add LMB, drc index %x\n", + be32_to_cpu(lmbs[i].drc_index)); + + lmbs[i].reserved = 0; + } + rc = -EINVAL; + } else { + for (i = start_index; i < end_index; i++) { + if (!lmbs[i].reserved) + continue; + + pr_info("Memory at %llx (drc index %x) was hot-removed\n", + lmbs[i].base_addr, lmbs[i].drc_index); + + lmbs[i].reserved = 0; + } + } + + return rc; +} + #else static inline int pseries_remove_memblock(unsigned long base, unsigned int memblock_size) @@ -853,6 +939,10 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog) } else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) { drc_index = hp_elog->_drc_u.drc_index; rc = dlpar_memory_remove_by_index(drc_index, prop); + } else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_IC) { + count = hp_elog->_drc_u.indexed_count[0]; + drc_index = hp_elog->_drc_u.indexed_count[1]; + rc = dlpar_memory_remove_by_ic(count, drc_index, prop); } else { rc = -EINVAL; }
Re: [PATCH] mm/hugetlb: Use the right pte val for compare in hugetlb_cow
On Tue, 18 Oct 2016 21:12:45 +0530 "Aneesh Kumar K.V" wrote: > We cannot use the pte value used in set_pte_at for pte_same comparison, > because archs like ppc64, filter/add new pte flag in set_pte_at. Instead > fetch the pte value inside hugetlb_cow. We are comparing pte value to > make sure the pte didn't change since we dropped the page table lock. > hugetlb_cow get called with page table lock held, and we can take a copy > of the pte value before we drop the page table lock. > > With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no > previous mapping (huge_pte_none entries), by forcing a cow in the fault > path. This avoid take an addition fault to covert a read-only mapping > to read/write. Here we were comparing a recently instantiated pte (via > set_pte_at) to the pte values from linux page table. As explained above > on ppc64 such pte_same check returned wrong result, resulting in us > taking an additional fault on ppc64. >From my reading this is a minor performance improvement and a -stable backport isn't needed. But it is unclear whether the impact warrants a 4.9 merge. Please be careful about describing end-user visible impacts when fixing bugs, thanks.
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote: > That doesn't compile, wq doesn't exist. > > I guessed that you meant: > > + wq_numa_init(); > + list_for_each_entry(wq, &workqueues, list) > + wq_update_unbound_numa(wq, smp_processor_id(), true); Yeap, sorry about that. > And that does boot. > > The sysrq-t output is below, it's rather large. Didn't expect that many cpus but it looks good. I'll post proper patches soon. Thanks! -- tejun
Re: [PATCH] console: use first console if stdout-path device doesn't appear
On Okt 18 2016, Paul Burton wrote: > If a device tree specified a preferred device for kernel console output > via the stdout-path or linux,stdout-path chosen node properties there's > no guarantee that it will have specified a device for which we have a > driver. It may also be the case that we do have a driver but it doesn't > call of_console_check() to register as a preferred console (eg. offb > driver as used on powermac systems). In these cases try to ensure that > we provide some console output by enabling the first console in the > console_drivers list. > > As I don't have access to an affected system this has only been build > tested - testing would be most appreciated. Unfortunately that doesn't work. The initial console still cannot be opened. Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
Re: powerpc/64: Fix incorrect return value from __copy_tofrom_user
On Tue, 2016-11-10 at 11:25:47 UTC, Paul Mackerras wrote: > Debugging a data corruption issue with virtio-net/vhost-net led to > the observation that __copy_tofrom_user was occasionally returning > a value 16 larger than it should. Since the return value from > __copy_tofrom_user is the number of bytes not copied, this means > that __copy_tofrom_user can occasionally return a value larger > than the number of bytes it was asked to copy. In turn this can > cause higher-level copy functions such as copy_page_to_iter_iovec > to corrupt memory by copying data into the wrong memory locations. > > It turns out that the failing case involves a fault on the store > at label 79, and at that point the first unmodified byte of the > destination is at R3 + 16. Consequently the exception handler > for that store needs to add 16 to R3 before using it to work out > how many bytes were not copied, but in this one case it was not > adding the offset to R3. To fix it, this moves the label 179 to > the point where we add 16 to R3. I have checked manually all the > exception handlers for the loads and stores in this code and the > rest of them are correct (it would be excellent to have an > automated test of all the exception cases). > > This bug has been present since this code was initially > committed in May 2002 to Linux version 2.5.20. > > Cc: sta...@vger.kernel.org > Signed-off-by: Paul Mackerras Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/1a34439e5a0b2235e43f96816dbb15 cheers
Re: powerpc/mm/hash64: Fix might_have_hea() check
On Tue, 2016-11-10 at 10:15:04 UTC, Michael Ellerman wrote: > In commit 2b4e3ad8f579 ("powerpc/mm/hash64: Don't test for machine type > to detect HEA special case") we changed the logic in might_have_hea() > to check FW_FEATURE_SPLPAR rather than machine_is(pseries). > > However the check was incorrectly negated, leading to crashes on > machines with HEA adapters, such as: > > mm: Hashing failure ! EA=0xd80080004040 access=0x800e > current=NetworkManager > trap=0x300 vsid=0x13d349c ssize=1 base psize=2 psize 2 > pte=0xc0003cc033e701ae > Unable to handle kernel paging request for data at address > 0xd80080004040 > Call Trace: > .ehea_create_cq+0x148/0x340 [ehea] (unreliable) > .ehea_up+0x258/0x1200 [ehea] > .ehea_open+0x44/0x1a0 [ehea] > ... > > Fix it by removing the negation. > > Fixes: 2b4e3ad8f579 ("powerpc/mm/hash64: Don't test for machine type to > detect HEA special case") > Cc: sta...@vger.kernel.org # v4.8+ > Reported-by: Denis Kirjanov > Reported-by: Jan Stancek > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/08bf75ba852ef8304a84b6a030466b cheers
Re: powerpc/pseries: Fix stack corruption in htpe code
On Thu, 2016-06-10 at 13:33:21 UTC, Laurent Dufour wrote: > This commit fixes a stack corruption in the pseries specific code dealing > with the huge pages. > > In __pSeries_lpar_hugepage_invalidate() the buffer used to pass arguments > to the hypervisor is not large enough. This leads to a stack corruption > where a previously saved register could be corrupted leading to unexpected > result in the caller, like the following panic: > > Oops: Kernel access of bad area, sig: 11 [#1] > SMP NR_CPUS=2048 NUMA pSeries > Modules linked in: virtio_balloon ip_tables x_tables autofs4 > virtio_blk 8139too virtio_pci virtio_ring 8139cp virtio > CPU: 11 PID: 1916 Comm: mmstress Not tainted 4.8.0 #76 > task: c5394880 task.stack: c557 > NIP: c027bf6c LR: c027bf64 CTR: > REGS: c5573820 TRAP: 0300 Not tainted (4.8.0) > MSR: 80009033 CR: 84822884 XER: > 2000 > CFAR: c010a924 DAR: 4214e5e0 DSISR: 4000 SOFTE: 1 > GPR00: c027bf64 c5573aa0 c0e02800 c4447964 > GPR04: c404de18 c4d38810 042100f5 f5002104 > GPR08: e000f5002104 0001 042100f500e0 042100f5 > GPR12: 2200 cfe02c00 c404de18 > GPR16: c1ffe7ff 3fff6200 4214e5e0 3fff6300 > GPR20: 0008 c000f7014800 0405e6e0 0001 > GPR24: c4d38810 c4447c10 c404de18 c4447964 > GPR28: c5573b10 c4d38810 3fff6200 4214e5e0 > NIP [c027bf6c] zap_huge_pmd+0x4c/0x470 > LR [c027bf64] zap_huge_pmd+0x44/0x470 > Call Trace: > [c5573aa0] [c027bf64] zap_huge_pmd+0x44/0x470 (unreliable) > [c5573af0] [c022bbd8] unmap_page_range+0xcf8/0xed0 > [c5573c30] [c022c2d4] unmap_vmas+0x84/0x120 > [c5573c80] [c0235448] unmap_region+0xd8/0x1b0 > [c5573d80] [c02378f0] do_munmap+0x2d0/0x4c0 > [c5573df0] [c0237be4] SyS_munmap+0x64/0xb0 > [c5573e30] [c0009560] system_call+0x38/0x108 > Instruction dump: > fbe1fff8 fb81ffe0 7c7f1b78 7ca32b78 7cbd2b78 f8010010 7c9a2378 f821ffb1 > 7cde3378 4bfffea9 7c7b1b79 41820298 48000130 7fa5eb78 7fc4f378 > > Most of the time, the bug is surfacing in a caller up in the stack from > __pSeries_lpar_hugepage_invalidate() which is quite confusing. > > This bug is pending since v3.11 but was hidden if a caller of the > caller of __pSeries_lpar_hugepage_invalidate() has pushed the corruped > register (r18 in this case) in the stack and is not using it until > restoring it. GCC 6.2.0 seems to raise it more frequently. > > This commit also change the definition of the parameter buffer in > pSeries_lpar_flush_hash_range() to rely on the global define > PLPAR_HCALL9_BUFSIZE (no functional change here). > > Fixes: 1a5272866f87 ("powerpc: Optimize hugepage invalidate") > Cc: > Cc: Aneesh Kumar K.V > Signed-off-by: Laurent Dufour > Reviewed-by: Aneesh Kumar K.V > Acked-by: Balbir Singh Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/05af40e885955065aee8bb7425058e cheers
Re: MAINTAINERS: Drop separate pseries entry
On Wed, 2016-05-10 at 05:04:00 UTC, Michael Ellerman wrote: > Paul is no longer acting as a separate maintainer for pseries, it is > handled along with the rest of powerpc. The URL no longer links anywhere > meaningful, so drop it also. > > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/55099115ca523b5edf5c542bd5c328 cheers
Re: ppc64 qemu test failure since commit f9aa67142 ("powerpc/64s: Consolidate Alignment 0x600 interrupt")
On Tue, 2016-11-10 at 07:47:56 UTC, Nicholas Piggin wrote: > On Mon, 10 Oct 2016 07:15:11 -0700 > Guenter Roeck wrote: > > > On 10/09/2016 10:49 PM, Nicholas Piggin wrote: > > > On Sun, 9 Oct 2016 08:21:21 -0700 > > > Guenter Roeck wrote: > > > > > >> Nicholas, > > >> > > >> some of my qemu tests for ppc64 started failing on mainline (and -next). > > >> You can find a test log at > > >> http://kerneltests.org/builders/qemu-ppc64-master/builds/580/steps/qemubuildcommand/logs/stdio > > >> > > >> The scripts to run the test are available at > > >> https://github.com/groeck/linux-build-test/tree/master/rootfs/ppc64 > > >> > > >> Bisect points to commit f9aa67142ef26 ("powerpc/64s: Consolidate > > >> Alignment 0x600 > > >> interrupt"). Bisect log is attached. > > >> > > >> Since I don't have the means to run the code on a real system, I have no > > >> idea > > >> if the problem is caused by qemu or by the code. It is interesting, > > >> though, that > > >> only the 'mac99' tests are affected. > > >> > > >> Please let me know if there is anything I can do to help tracking down > > >> the > > >> problem. > > > > > > Thanks for this. That patch just moves a small amount of code, so it's > > > likely > > > that it's caused something to get placed out of range of its caller, or > > > the > > > linker started generating a stub for some reason. I can't immediately see > > > the > > > problem, but it could be specific to your exact toolchain. > > > > > > Something that might help, would you be able to put the compiled vmlinux > > > binaries > > > from before/after the bad patch somewhere I can grab them? > > > > > > > http://server.roeck-us.net/qemu/ppc64/mac99/ > > > > 'bad' is at f9aa67142ef26, 'good' is one commit earlier, 'tot' is from top > > of tree > > (b66484cd7470, more specifically). > > > > Key difference in System.map, from the bad case: > > > > c0005c00 T __end_interrupts > > c0007000 t end_virt_trampolines > > c0008000 t 0010.long_branch.power4_fixup_nap+0 > > c0008100 t fs_label > > c0008100 t start_text > > > > 0010.long_branch.power4_fixup_nap+0 does not exist in the good case, > > and fs_label/start_text are at c0008000. > > > > Toolchain is from poky 1.5.1, which uses gcc 4.8.1 and binutils 2.23.2. > > I also tried with the toolchain from poky 1.6, using gcc 4.8.2 and binutils > > 2.24, > > with the same result. > > Thank you for the quick response, this points to the exact problem. > I've attached a patch which should fix the bug. There are some checks > I've got planned that will catch this type of thing at build time and > be much easier to track down. > > Thanks, > Nick > > From: Nicholas Piggin > Date: Tue, 11 Oct 2016 18:33:26 +1100 > Subject: [PATCH] powerpc/64s: fix power4_fixup_nap placement > > power4_fixup_nap is called from the "common" handlers, not the virt/real > handlers, therefore it should itself be a common handler. Placing it > down in the trampoline space caused it to go out of reach of its > callers, requiring a trampoline inserted at the start of the text > section, which breaks the fixed section address calculations. > > Reported-by: Guenter Roeck > Signed-off-by: Nicholas Piggin Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/7c8cb4b50f3cc6f4a8f7bfddad6fb5 cheers
Re: MAINTAINERS: Update powerpc website & add selftests
On Wed, 2016-05-10 at 05:08:16 UTC, Michael Ellerman wrote: > The selftests under tools/testing/selftests/powerpc are maintained by > us, so add a file pattern for them. > > Also drop the www.penguinppc.org link, it's not dead, but the site is > dead (database error). Instead link to the wiki attached to our github, > there is some info there which may be useful, which is better than none. > > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/ad654f25a12d1cbac890d33b7bcc87 cheers
Re: selftests/powerpc: Fix build break caused by EXPORT_SYMBOL changes
On Wed, 2016-05-10 at 05:22:57 UTC, Michael Ellerman wrote: > The changes to make EXPORT_SYMBOL work in asm, specifically commit > 9445aa1a3062 ("ppc: move exports to definitions"), in the kbuild tree, > breaks some of our selftests. > > That is because we symlink the kernel code into the selftest, and shim > the required headers, and we are now missing asm/export.h > > So create a minimal export.h to keep the tests building once powerpc and > the kbuild trees are merged. > > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/8321564a11bbeadffcc7d6335bcf3c cheers
RE: [v12, 0/8] Fix eSDHC host version register bug
> -Original Message- > From: Ulf Hansson [mailto:ulf.hans...@linaro.org] > Sent: Tuesday, October 18, 2016 6:48 PM > To: Y.B. Lu > Cc: linux-mmc; Scott Wood; Arnd Bergmann; linuxppc-dev@lists.ozlabs.org; > devicet...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; linux- > ker...@vger.kernel.org; linux-clk; linux-...@vger.kernel.org; > io...@lists.linux-foundation.org; net...@vger.kernel.org; Mark Rutland; > Rob Herring; Russell King; Jochen Friedrich; Joerg Roedel; Claudiu Manoil; > Bhupesh Sharma; Qiang Zhao; Kumar Gala; Santosh Shilimkar; Leo Li; X.B. > Xie; M.H. Lian > Subject: Re: [v12, 0/8] Fix eSDHC host version register bug > > On 21 September 2016 at 08:57, Yangbo Lu wrote: > > This patchset is used to fix a host version register bug in the > > T4240-R1.0-R2.0 eSDHC controller. To match the SoC version and > > revision, 10 previous version patchsets had tried many methods but all > of them were rejected by reviewers. > > Such as > > - dts compatible method > > - syscon method > > - ifdef PPC method > > - GUTS driver getting SVR method Anrd suggested a > > soc_device_match method in v10, and this is the only available method > > left now. This v11 patchset introduces the soc_device_match interface > > in soc driver. > > > > The first six patches of Yangbo are to add the GUTS driver. This is > > used to register a soc device which contain soc version and revision > information. > > The other two patches introduce the soc_device_match method in soc > > driver and apply it on esdhc driver to fix this bug. > > > > Arnd Bergmann (1): > > base: soc: introduce soc_device_match() interface > > > > Yangbo Lu (7): > > dt: bindings: update Freescale DCFG compatible > > ARM64: dts: ls2080a: add device configuration node > > dt: bindings: move guts devicetree doc out of powerpc directory > > powerpc/fsl: move mpc85xx.h to include/linux/fsl > > soc: fsl: add GUTS driver for QorIQ platforms > > MAINTAINERS: add entry for Freescale SoC drivers > > mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0 > > > > Documentation/devicetree/bindings/arm/fsl.txt | 6 +- > > .../bindings/{powerpc => soc}/fsl/guts.txt | 3 + > > MAINTAINERS| 11 +- > > arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 + > > arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- > > arch/powerpc/sysdev/fsl_pci.c | 2 +- > > drivers/base/Kconfig | 1 + > > drivers/base/soc.c | 66 ++ > > drivers/clk/clk-qoriq.c| 3 +- > > drivers/i2c/busses/i2c-mpc.c | 2 +- > > drivers/iommu/fsl_pamu.c | 3 +- > > drivers/mmc/host/Kconfig | 1 + > > drivers/mmc/host/sdhci-of-esdhc.c | 20 ++ > > drivers/net/ethernet/freescale/gianfar.c | 2 +- > > drivers/soc/Kconfig| 2 +- > > drivers/soc/fsl/Kconfig| 19 ++ > > drivers/soc/fsl/Makefile | 1 + > > drivers/soc/fsl/guts.c | 257 > + > > include/linux/fsl/guts.h | 125 ++ > > .../asm/mpc85xx.h => include/linux/fsl/svr.h | 4 +- > > include/linux/sys_soc.h| 3 + > > 21 files changed, 478 insertions(+), 61 deletions(-) rename > > Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) > > create mode 100644 drivers/soc/fsl/Kconfig create mode 100644 > > drivers/soc/fsl/guts.c rename arch/powerpc/include/asm/mpc85xx.h => > > include/linux/fsl/svr.h (97%) > > > > -- > > 2.1.0.27.g96db324 > > > > This looks good to me! I am not sure which tree you want this to be > picked up through, but unless no other volunteers I can take it through > my mmc tree. > > Although, before considering to apply, I need an ack from Scott/Arnd for > the guts driver in patch 5/8 and I need an ack from Greg for patch 7/8, > where the soc_device_match() interface is added (seems like you didn't > add him on cc/to). > [Lu Yangbo-B47093] Thanks a lot for your clarifying, Uffe. This patchset was based on mmc tree, and needed your picking up. But I think it needs to be rebased now since I saw qbman driver was in drivers/soc/fsl/ now. I will do that after collecting others' ACKs or comments. Hi Scott and Arnd, Could I get your ACTs if there're no other changes needed? Thanks a lot. > Kind regards > Uffe
RE: [v12, 0/8] Fix eSDHC host version register bug
+ Greg Hi Greg, I submitted this patchset for a MMC bug fix, and introduce the below patch which needs your ACK. > > Arnd Bergmann (1): > > base: soc: introduce soc_device_match() interface https://patchwork.kernel.org/patch/9342913/ Could you help to review it and give some comments or ACK. Thank you very much. Best regards, Yangbo Lu > -Original Message- > From: Ulf Hansson [mailto:ulf.hans...@linaro.org] > Sent: Tuesday, October 18, 2016 6:48 PM > To: Y.B. Lu > Cc: linux-mmc; Scott Wood; Arnd Bergmann; linuxppc-dev@lists.ozlabs.org; > devicet...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; linux- > ker...@vger.kernel.org; linux-clk; linux-...@vger.kernel.org; > io...@lists.linux-foundation.org; net...@vger.kernel.org; Mark Rutland; > Rob Herring; Russell King; Jochen Friedrich; Joerg Roedel; Claudiu Manoil; > Bhupesh Sharma; Qiang Zhao; Kumar Gala; Santosh Shilimkar; Leo Li; X.B. > Xie; M.H. Lian > Subject: Re: [v12, 0/8] Fix eSDHC host version register bug > > On 21 September 2016 at 08:57, Yangbo Lu wrote: > > This patchset is used to fix a host version register bug in the > > T4240-R1.0-R2.0 eSDHC controller. To match the SoC version and > > revision, 10 previous version patchsets had tried many methods but all > of them were rejected by reviewers. > > Such as > > - dts compatible method > > - syscon method > > - ifdef PPC method > > - GUTS driver getting SVR method Anrd suggested a > > soc_device_match method in v10, and this is the only available method > > left now. This v11 patchset introduces the soc_device_match interface > > in soc driver. > > > > The first six patches of Yangbo are to add the GUTS driver. This is > > used to register a soc device which contain soc version and revision > information. > > The other two patches introduce the soc_device_match method in soc > > driver and apply it on esdhc driver to fix this bug. > > > > Arnd Bergmann (1): > > base: soc: introduce soc_device_match() interface > > > > Yangbo Lu (7): > > dt: bindings: update Freescale DCFG compatible > > ARM64: dts: ls2080a: add device configuration node > > dt: bindings: move guts devicetree doc out of powerpc directory > > powerpc/fsl: move mpc85xx.h to include/linux/fsl > > soc: fsl: add GUTS driver for QorIQ platforms > > MAINTAINERS: add entry for Freescale SoC drivers > > mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0 > > > > Documentation/devicetree/bindings/arm/fsl.txt | 6 +- > > .../bindings/{powerpc => soc}/fsl/guts.txt | 3 + > > MAINTAINERS| 11 +- > > arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 + > > arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- > > arch/powerpc/sysdev/fsl_pci.c | 2 +- > > drivers/base/Kconfig | 1 + > > drivers/base/soc.c | 66 ++ > > drivers/clk/clk-qoriq.c| 3 +- > > drivers/i2c/busses/i2c-mpc.c | 2 +- > > drivers/iommu/fsl_pamu.c | 3 +- > > drivers/mmc/host/Kconfig | 1 + > > drivers/mmc/host/sdhci-of-esdhc.c | 20 ++ > > drivers/net/ethernet/freescale/gianfar.c | 2 +- > > drivers/soc/Kconfig| 2 +- > > drivers/soc/fsl/Kconfig| 19 ++ > > drivers/soc/fsl/Makefile | 1 + > > drivers/soc/fsl/guts.c | 257 > + > > include/linux/fsl/guts.h | 125 ++ > > .../asm/mpc85xx.h => include/linux/fsl/svr.h | 4 +- > > include/linux/sys_soc.h| 3 + > > 21 files changed, 478 insertions(+), 61 deletions(-) rename > > Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) > > create mode 100644 drivers/soc/fsl/Kconfig create mode 100644 > > drivers/soc/fsl/guts.c rename arch/powerpc/include/asm/mpc85xx.h => > > include/linux/fsl/svr.h (97%) > > > > -- > > 2.1.0.27.g96db324 > > > > This looks good to me! I am not sure which tree you want this to be > picked up through, but unless no other volunteers I can take it through > my mmc tree. > > Although, before considering to apply, I need an ack from Scott/Arnd for > the guts driver in patch 5/8 and I need an ack from Greg for patch 7/8, > where the soc_device_match() interface is added (seems like you didn't > add him on cc/to). > > Kind regards > Uffe
[PATCH 0/7] build updates
Hi, I was hoping to get these posted earlier, but they had a dependency on the kbuild tree which took a while to merge. Hopefully we can get at least the first 4 patches in, which are just sanity checks on the final binary. With this series applied, I'm able to build[*] and boot a 64-bit book3s allyesconfig kernel, a 200MB vmlinux (although today's tree breaks in some random driver initcall in late boot). That's largely a culmination of Stephen Rothwell's work (including the kbuild changes that were merged earlier in 4.9), I just picked up some of his patches so attributions may not be 100% reflective of reality. [*] Building allyesconfig still requires KALLSYMS_EXTRA_PASS=1, which I'm yet to look into. Thanks, Nick Nicholas Piggin (7): powerpc: use the new post-link pass to check relocations powerpc: add arch/powerpc/tools directory powerpc/64s: tool to flag direct branches from unrelocated interrupt vectors powerpc/64: tool to check head sections location sanity powerpc/64: handle linker stubs in low .text code powerpc: switch to using thin archives if COMPILE_TEST is set powerpc/64: allow CONFIG_RELOCATABLE if COMPILE_TEST arch/powerpc/Kconfig | 5 +- arch/powerpc/Makefile | 17 + arch/powerpc/Makefile.postlink | 48 ++ arch/powerpc/include/asm/head-64.h | 20 -- arch/powerpc/kernel/vmlinux.lds.S | 28 ++-- .../gcc-check-mprofile-kernel.sh | 0 arch/powerpc/tools/head_check.sh | 74 ++ arch/powerpc/{ => tools}/relocs_check.sh | 0 arch/powerpc/tools/unrel_branch_check.sh | 56 9 files changed, 203 insertions(+), 45 deletions(-) create mode 100644 arch/powerpc/Makefile.postlink rename arch/powerpc/{scripts => tools}/gcc-check-mprofile-kernel.sh (100%) create mode 100755 arch/powerpc/tools/head_check.sh rename arch/powerpc/{ => tools}/relocs_check.sh (100%) create mode 100755 arch/powerpc/tools/unrel_branch_check.sh -- 2.9.3
[PATCH 1/7] powerpc: use the new post-link pass to check relocations
Currently powerpc has to introduce a dependency on its default build target zImage in order to run a relocation check pass over the linked vmlinux. This is deficient because the check is not run if the plain vmlinux target is built, or if one of the other boot targets is built. Switch to using the kbuild post-link pass in order to run this check. Signed-off-by: Nicholas Piggin --- arch/powerpc/Makefile | 11 --- arch/powerpc/Makefile.postlink | 34 ++ 2 files changed, 34 insertions(+), 11 deletions(-) create mode 100644 arch/powerpc/Makefile.postlink diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 617dece..314dd77 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -263,17 +263,6 @@ PHONY += $(BOOT_TARGETS1) $(BOOT_TARGETS2) boot := arch/$(ARCH)/boot -ifeq ($(CONFIG_RELOCATABLE),y) -quiet_cmd_relocs_check = CALL$< - cmd_relocs_check = $(CONFIG_SHELL) $< "$(OBJDUMP)" "$(obj)/vmlinux" - -PHONY += relocs_check -relocs_check: arch/powerpc/relocs_check.sh vmlinux - $(call cmd,relocs_check) - -zImage: relocs_check -endif - $(BOOT_TARGETS1): vmlinux $(Q)$(MAKE) ARCH=ppc64 $(build)=$(boot) $(patsubst %,$(boot)/%,$@) $(BOOT_TARGETS2): vmlinux diff --git a/arch/powerpc/Makefile.postlink b/arch/powerpc/Makefile.postlink new file mode 100644 index 000..f90bdc0 --- /dev/null +++ b/arch/powerpc/Makefile.postlink @@ -0,0 +1,34 @@ +# === +# Post-link powerpc pass +# === +# +# 1. Check that vmlinux relocations look sane + +PHONY := __archpost +__archpost: + +include include/config/auto.conf +include scripts/Kbuild.include + +quiet_cmd_relocs_check = CHKREL $@ + cmd_relocs_check = $(CONFIG_SHELL) $(srctree)/arch/powerpc/relocs_check.sh "$(OBJDUMP)" "$@" + +# `@true` prevents complaint when there is nothing to be done + +vmlinux: FORCE + @true +ifeq ($(CONFIG_RELOCATABLE),y) + $(call if_changed,relocs_check) +endif + +%.ko: FORCE + @true + +clean: + @true + +PHONY += FORCE clean + +FORCE: + +.PHONY: $(PHONY) -- 2.9.3
[PATCH 2/7] powerpc: add arch/powerpc/tools directory
Move a couple of existing scripts there, and remove scripts directory: a script is a tool, a tool is not a script. I plan to add some tools that require compilation (though not in this series), so this matches other architectures more closely. Signed-off-by: Nick Piggin --- arch/powerpc/Makefile| 2 +- arch/powerpc/Makefile.postlink | 2 +- arch/powerpc/{scripts => tools}/gcc-check-mprofile-kernel.sh | 0 arch/powerpc/{ => tools}/relocs_check.sh | 0 4 files changed, 2 insertions(+), 2 deletions(-) rename arch/powerpc/{scripts => tools}/gcc-check-mprofile-kernel.sh (100%) rename arch/powerpc/{ => tools}/relocs_check.sh (100%) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 314dd77..0fcb47c 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -126,7 +126,7 @@ CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=powerpc64 endif ifdef CONFIG_MPROFILE_KERNEL -ifeq ($(shell $(srctree)/arch/powerpc/scripts/gcc-check-mprofile-kernel.sh $(CC) -I$(srctree)/include -D__KERNEL__),OK) +ifeq ($(shell $(srctree)/arch/powerpc/tools/gcc-check-mprofile-kernel.sh $(CC) -I$(srctree)/include -D__KERNEL__),OK) CC_FLAGS_FTRACE := -pg -mprofile-kernel KBUILD_CPPFLAGS += -DCC_USING_MPROFILE_KERNEL else diff --git a/arch/powerpc/Makefile.postlink b/arch/powerpc/Makefile.postlink index f90bdc0..29c0a3a 100644 --- a/arch/powerpc/Makefile.postlink +++ b/arch/powerpc/Makefile.postlink @@ -11,7 +11,7 @@ include include/config/auto.conf include scripts/Kbuild.include quiet_cmd_relocs_check = CHKREL $@ - cmd_relocs_check = $(CONFIG_SHELL) $(srctree)/arch/powerpc/relocs_check.sh "$(OBJDUMP)" "$@" + cmd_relocs_check = $(CONFIG_SHELL) $(srctree)/arch/powerpc/tools/relocs_check.sh "$(OBJDUMP)" "$@" # `@true` prevents complaint when there is nothing to be done diff --git a/arch/powerpc/scripts/gcc-check-mprofile-kernel.sh b/arch/powerpc/tools/gcc-check-mprofile-kernel.sh similarity index 100% rename from arch/powerpc/scripts/gcc-check-mprofile-kernel.sh rename to arch/powerpc/tools/gcc-check-mprofile-kernel.sh diff --git a/arch/powerpc/relocs_check.sh b/arch/powerpc/tools/relocs_check.sh similarity index 100% rename from arch/powerpc/relocs_check.sh rename to arch/powerpc/tools/relocs_check.sh -- 2.9.3
[PATCH 3/7] powerpc/64s: tool to flag direct branches from unrelocated interrupt vectors
Direct banches from code below __end_interrupts to code above __end_interrupts when built with CONFIG_RELOCATABLE are disallowed because they will break when the kernel is not located at 0. Sample output: WARNING: Unrelocated relative branches c118 bl-> 0xc0038fb8 c13c b-> 0xc01068a4 c148 b-> 0xc003919c c14c b-> 0xc003923c c5a4 b-> 0xc0106ffc c0001af0 b-> 0xc0106ffc c0001b24 b-> 0xc0106ffc c0001b58 b-> 0xc0106ffc Signed-off-by: Nicholas Piggin --- arch/powerpc/Makefile.postlink | 9 - arch/powerpc/tools/unrel_branch_check.sh | 56 2 files changed, 64 insertions(+), 1 deletion(-) create mode 100755 arch/powerpc/tools/unrel_branch_check.sh diff --git a/arch/powerpc/Makefile.postlink b/arch/powerpc/Makefile.postlink index 29c0a3a..1725e64 100644 --- a/arch/powerpc/Makefile.postlink +++ b/arch/powerpc/Makefile.postlink @@ -11,7 +11,14 @@ include include/config/auto.conf include scripts/Kbuild.include quiet_cmd_relocs_check = CHKREL $@ - cmd_relocs_check = $(CONFIG_SHELL) $(srctree)/arch/powerpc/tools/relocs_check.sh "$(OBJDUMP)" "$@" +ifeq ($(CONFIG_PPC_BOOK3S_64),y) + cmd_relocs_check = \ + $(CONFIG_SHELL) $(srctree)/arch/powerpc/tools/relocs_check.sh "$(OBJDUMP)" "$@" ; \ + $(CONFIG_SHELL) $(srctree)/arch/powerpc/tools/unrel_branch_check.sh "$(OBJDUMP)" "$@" +else + cmd_relocs_check = \ + $(CONFIG_SHELL) $(srctree)/arch/powerpc/tools/relocs_check.sh "$(OBJDUMP)" "$@" +endif # `@true` prevents complaint when there is nothing to be done diff --git a/arch/powerpc/tools/unrel_branch_check.sh b/arch/powerpc/tools/unrel_branch_check.sh new file mode 100755 index 000..524a9e2 --- /dev/null +++ b/arch/powerpc/tools/unrel_branch_check.sh @@ -0,0 +1,56 @@ +#!/bin/bash +# +# Copyright © 2016 IBM Corporation +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License +# as published by the Free Software Foundation; either version +# 2 of the License, or (at your option) any later version. +# +# This script checks the relocations of a vmlinux for "suspicious" +# branches from unrelocated code (head_64.S code). + +# Have Kbuild supply the path to objdump so we handle cross compilation. +objdump="$1" +vmlinux="$2" + +#__end_interrupts should be located within the first 64K + +end_intr=0x$( +"$objdump" -R "$vmlinux" -d --start-address=0xc000 \ +--stop-address=0xc001 | +grep '\<__end_interrupts>:' | +awk '{print $1}' +) + +BRANCHES=$( +"$objdump" -R "$vmlinux" -D --start-address=0xc000 \ + --stop-address=${end_intr} | +grep -e "^c[0-9a-f]*:[[:space:]]*\([0-9a-f][0-9a-f][[:space:]]\)\{4\}[[:space:]]*b" | +grep -v '\<__start_initialization_multiplatform>' | +grep -v -e 'b.\?.\?ctr' | +grep -v -e 'b.\?.\?lr' | +sed 's/://' | +awk '{ print $1 ":" $6 ":0x" $7 ":" $8 " "}' +) + +for tuple in $BRANCHES +do + from=`echo $tuple | cut -d':' -f1` + branch=`echo $tuple | cut -d':' -f2` + to=`echo $tuple | cut -d':' -f3 | sed 's/cr[0-7],//'` + sym=`echo $tuple | cut -d':' -f4` + + if (( $to > $end_intr )) + then + if [ -z "$bad_branches" ]; then + echo "WARNING: Unrelocated relative branches" + bad_branches="yes" + fi + echo "$from $branch-> $to $sym" + fi +done + +if [ -z "$bad_branches" ]; then + exit 0 +fi -- 2.9.3
[PATCH 4/7] powerpc/64: tool to check head sections location sanity
Use a tool to check the location of "fixed sections" is where we expected them, which catches cases the linker script can't (stubs being added to start of .text section), and which ends up being neater. Sample output: ERROR: start_text address is c0008100, should be c0008000 ERROR: see comments in arch/powerpc/tools/head_check.sh Signed-off-by: Nicholas Piggin --- arch/powerpc/Makefile.postlink | 7 arch/powerpc/include/asm/head-64.h | 4 +-- arch/powerpc/kernel/vmlinux.lds.S | 26 ++ arch/powerpc/tools/head_check.sh | 69 ++ 4 files changed, 80 insertions(+), 26 deletions(-) create mode 100755 arch/powerpc/tools/head_check.sh diff --git a/arch/powerpc/Makefile.postlink b/arch/powerpc/Makefile.postlink index 1725e64..b8fe12b 100644 --- a/arch/powerpc/Makefile.postlink +++ b/arch/powerpc/Makefile.postlink @@ -10,6 +10,9 @@ __archpost: include include/config/auto.conf include scripts/Kbuild.include +quiet_cmd_head_check = CHKHEAD $@ + cmd_head_check = $(CONFIG_SHELL) $(srctree)/arch/powerpc/tools/head_check.sh "$(NM)" "$@" + quiet_cmd_relocs_check = CHKREL $@ ifeq ($(CONFIG_PPC_BOOK3S_64),y) cmd_relocs_check = \ @@ -24,6 +27,9 @@ endif vmlinux: FORCE @true +ifeq ($(CONFIG_PPC64),y) + $(call cmd,head_check) +endif ifeq ($(CONFIG_RELOCATABLE),y) $(call if_changed,relocs_check) endif @@ -32,6 +38,7 @@ endif @true clean: + rm -f .tmp_symbols.txt @true PHONY += FORCE clean diff --git a/arch/powerpc/include/asm/head-64.h b/arch/powerpc/include/asm/head-64.h index ab90c2f..492ebe7 100644 --- a/arch/powerpc/include/asm/head-64.h +++ b/arch/powerpc/include/asm/head-64.h @@ -49,8 +49,8 @@ * CLOSE_FIXED_SECTION() or elsewhere, there may be something * unexpected being added there. Remove the '. = x_len' line, rebuild, and * check what is pushing the section down. - * - If the build dies in linking, check arch/powerpc/kernel/vmlinux.lds.S - * for instructions. + * - If the build dies in linking, check arch/powerpc/tools/head_check.sh + * comments. * - If the kernel crashes or hangs in very early boot, it could be linker * stubs at the start of the main text. */ diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S index 8295f51..7de0b05 100644 --- a/arch/powerpc/kernel/vmlinux.lds.S +++ b/arch/powerpc/kernel/vmlinux.lds.S @@ -58,7 +58,6 @@ SECTIONS #ifdef CONFIG_PPC64 KEEP(*(.head.text.first_256B)); #ifdef CONFIG_PPC_BOOK3E -# define END_FIXED 0x100 #else KEEP(*(.head.text.real_vectors)); *(.head.text.real_trampolines); @@ -66,38 +65,17 @@ SECTIONS *(.head.text.virt_trampolines); # if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) KEEP(*(.head.data.fwnmi_page)); -# define END_FIXED0x8000 -# else -# define END_FIXED0x7000 # endif #endif - ASSERT((. == END_FIXED), "vmlinux.lds.S: fixed section overflow error"); #else /* !CONFIG_PPC64 */ HEAD_TEXT #endif } :kernel - /* -* If the build dies here, it's likely code in head_64.S is referencing -* labels it can't reach, and the linker inserting stubs without the -* assembler's knowledge. To debug, remove the above assert and -* rebuild. Look for branch stubs in the fixed section region. -* -* Linker stub generation could be allowed in "trampoline" -* sections if absolutely necessary, but this would require -* some rework of the fixed sections. Before resorting to this, -* consider references that have sufficient addressing range, -* (e.g., hand coded trampolines) so the linker does not have -* to add stubs. -* -* Linker stubs at the top of the main text section are currently not -* detected, and will result in a crash at boot due to offsets being -* wrong. -*/ .text : AT(ADDR(.text) - LOAD_OFFSET) { - ALIGN_FUNCTION(); /* careful! __ftr_alt_* sections need to be close to .text */ - *(.text .fixup __ftr_alt_* .ref.text) + ALIGN_FUNCTION(); + *(.text .fixup __ftr_alt_* .ref.text); SCHED_TEXT CPUIDLE_TEXT LOCK_TEXT diff --git a/arch/powerpc/tools/head_check.sh b/arch/powerpc/tools/head_check.sh new file mode 100755 index 000..9635fe7 --- /dev/null +++ b/arch/powerpc/tools/head_check.sh @@ -0,0 +1,69 @@ +#!/bin/sh + +# Copyright © 2016 IBM Corporation + +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License +# as published by the Free Software Foundation; either version +# 2 of the License, or (at your option) any later version. + +# Thi
[PATCH 6/7] powerpc: switch to using thin archives if COMPILE_TEST is set
Enable thin archives build for powerpc when COMPILE_TEST is set. Thin archives are explained in this commit: a5967db9af51a84f5e181600954714a9e4c69f1f This is a gradual way to introduce the option to testers. Some change to the way we invoke ar is required so it can be used by scripts/link-vmlinux.sh. Signed-off-by: Stephen Rothwell Signed-off-by: Nicholas Piggin --- arch/powerpc/Kconfig | 1 + arch/powerpc/Makefile | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 65fba4c..00d9e31 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -80,6 +80,7 @@ config ARCH_HAS_DMA_SET_COHERENT_MASK config PPC bool default y + select THIN_ARCHIVES if COMPILE_TEST select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO select BINFMT_ELF diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 0fcb47c..fe76cfe 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -23,7 +23,7 @@ CROSS32AR := $(CROSS32_COMPILE)ar ifeq ($(HAS_BIARCH),y) ifeq ($(CROSS32_COMPILE),) CROSS32CC := $(CC) -m32 -CROSS32AR := GNUTARGET=elf32-powerpc $(AR) +KBUILD_ARFLAGS += --target=elf32-powerpc endif endif @@ -85,7 +85,7 @@ ifeq ($(HAS_BIARCH),y) override AS+= -a$(BITS) override LD+= -m elf$(BITS)$(LDEMULATION) override CC+= -m$(BITS) -override AR:= GNUTARGET=elf$(BITS)-$(GNUTARGET) $(AR) +KBUILD_ARFLAGS += --target=elf$(BITS)-$(GNUTARGET) endif LDFLAGS_vmlinux-y := -Bstatic -- 2.9.3
[PATCH 7/7] powerpc/64: allow CONFIG_RELOCATABLE if COMPILE_TEST
This is cruft to work around allmodconfig build breakage. Signed-off-by: Nicholas Piggin --- arch/powerpc/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 00d9e31..f48d2eb 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -458,7 +458,7 @@ config KEXEC config RELOCATABLE bool "Build a relocatable kernel" - depends on (PPC64 && !COMPILE_TEST) || (FLATMEM && (44x || FSL_BOOKE)) + depends on PPC64 || (FLATMEM && (44x || FSL_BOOKE)) select NONSTATIC_KERNEL help This builds a kernel image that is capable of running at the @@ -482,7 +482,7 @@ config RELOCATABLE config CRASH_DUMP bool "Build a kdump crash kernel" depends on PPC64 || 6xx || FSL_BOOKE || (44x && !SMP) - select RELOCATABLE if (PPC64 && !COMPILE_TEST) || 44x || FSL_BOOKE + select RELOCATABLE if PPC64 || 44x || FSL_BOOKE help Build a kernel suitable for use as a kdump capture kernel. The same kernel binary can be used as production kernel and dump -- 2.9.3
[PATCH 5/7] powerpc/64: handle linker stubs in low .text code
Very large kernels require linker stubs, but the linker tends to place them in ways that make it very difficult to detect programatically with the assembler when taking absolute real (physical) addresses. This breaks the early boot code. Create a small section just before the .text section with an empty 256 - 4 bytes, and adjust the start of the .text section to match. The linker will tend to put stubs in that section and not break our start-of-.text section label. This is a tiny waste of space on common kernels, but allows large kernels to build and boot, which is convenient enough to outweigh the cost. This is a sad hack, which I will improve on if I can find out how to achieve it a better way. Until then, it seems to work. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/head-64.h | 16 +--- arch/powerpc/kernel/vmlinux.lds.S | 2 ++ arch/powerpc/tools/head_check.sh | 5 + 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/head-64.h b/arch/powerpc/include/asm/head-64.h index 492ebe7..f7131cf 100644 --- a/arch/powerpc/include/asm/head-64.h +++ b/arch/powerpc/include/asm/head-64.h @@ -63,11 +63,21 @@ . = 0x0;\ start_##sname: +/* + * .linker_stub_catch section is used to catch linker stubs from being + * inserted in our .text section, above the start_text label (which breaks + * the ABS_ADDR calculation). See kernel/vmlinux.lds.S and tools/head_check.sh + * for more details. We would prefer to just keep a cacheline (0x80), but + * 0x100 seems to be how the linker aligns branch stub groups. + */ #define OPEN_TEXT_SECTION(start) \ - text_start = (start); \ + .section ".linker_stub_catch","ax",@progbits; \ +linker_stub_catch: \ + . = 0x4;\ .section ".text","ax",@progbits;\ - . = 0x0;\ -start_text: + text_start = (start) + 0x100; \ + .balign 0x100; \ +start_text:\ #define ZERO_FIXED_SECTION(sname, start, end) \ sname##_start = (start);\ diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S index 7de0b05..a09c666 100644 --- a/arch/powerpc/kernel/vmlinux.lds.S +++ b/arch/powerpc/kernel/vmlinux.lds.S @@ -73,6 +73,8 @@ SECTIONS } :kernel .text : AT(ADDR(.text) - LOAD_OFFSET) { + *(.linker_stub_catch); + . = . ; /* careful! __ftr_alt_* sections need to be close to .text */ ALIGN_FUNCTION(); *(.text .fixup __ftr_alt_* .ref.text); diff --git a/arch/powerpc/tools/head_check.sh b/arch/powerpc/tools/head_check.sh index 9635fe7..ae9faeb 100755 --- a/arch/powerpc/tools/head_check.sh +++ b/arch/powerpc/tools/head_check.sh @@ -23,6 +23,11 @@ # etc) in the fixed section region (0 - 0x8000ish). Check what places are # calling those stubs. # +# A ".linker_stub_catch" section is used to catch some stubs generated by +# early .text code, which tend to get placed at the start of the section. +# If there are too many such stubs, they can overflow this section. Expanding +# it may help (or reducing the number of stub branches). +# # Linker stubs use the TOC pointer, so even if fixed section code could # tolerate them being inserted into head code, they can't be allowed in low # level entry code (boot, interrupt vectors, etc) until r2 is set up. This -- 2.9.3
Re: [PATCH 3/7] powerpc/64s: tool to flag direct branches from unrelocated interrupt vectors
On 19/10/16 14:15, Nicholas Piggin wrote: > Direct banches from code below __end_interrupts to code above > __end_interrupts when built with CONFIG_RELOCATABLE are disallowed > because they will break when the kernel is not located at 0. > > Sample output: > > WARNING: Unrelocated relative branches > c118 bl-> 0xc0038fb8 > c13c b-> 0xc01068a4 > c148 b-> 0xc003919c > c14c b-> 0xc003923c > c5a4 b-> 0xc0106ffc > c0001af0 b-> 0xc0106ffc > c0001b24 b-> 0xc0106ffc > c0001b58 b-> 0xc0106ffc > > Signed-off-by: Nicholas Piggin > --- If this is the same script I gave you, you can add my SOB line as well Balbir
Re: [PATCH] mm/hugetlb: Use the right pte val for compare in hugetlb_cow
Andrew Morton writes: > On Tue, 18 Oct 2016 21:12:45 +0530 "Aneesh Kumar K.V" > wrote: > >> We cannot use the pte value used in set_pte_at for pte_same comparison, >> because archs like ppc64, filter/add new pte flag in set_pte_at. Instead >> fetch the pte value inside hugetlb_cow. We are comparing pte value to >> make sure the pte didn't change since we dropped the page table lock. >> hugetlb_cow get called with page table lock held, and we can take a copy >> of the pte value before we drop the page table lock. >> >> With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no >> previous mapping (huge_pte_none entries), by forcing a cow in the fault >> path. This avoid take an addition fault to covert a read-only mapping >> to read/write. Here we were comparing a recently instantiated pte (via >> set_pte_at) to the pte values from linux page table. As explained above >> on ppc64 such pte_same check returned wrong result, resulting in us >> taking an additional fault on ppc64. > > From my reading this is a minor performance improvement and a -stable > backport isn't needed. But it is unclear whether the impact warrants a > 4.9 merge. This patch workaround the issue reported at https://lkml.kernel.org/r/57ff7bb4.1070...@redhat.com The reason for that OOM was a reserve count accounting issue which happens in the error path of hugetlb_cow. Not this patch avoid us taking the error path and hence we don't have the reported OOM. An actual fix for that issue is being worked on by Mike Kravetz. > > Please be careful about describing end-user visible impacts when fixing > bugs, thanks. -aneesh
Re: [PATCH] mm/hugetlb: Use the right pte val for compare in hugetlb_cow
On Tuesday, October 18, 2016 11:43 PM Aneesh Kumar K.V wrote: > > We cannot use the pte value used in set_pte_at for pte_same comparison, > because archs like ppc64, filter/add new pte flag in set_pte_at. Instead > fetch the pte value inside hugetlb_cow. We are comparing pte value to > make sure the pte didn't change since we dropped the page table lock. > hugetlb_cow get called with page table lock held, and we can take a copy > of the pte value before we drop the page table lock. > > With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no > previous mapping (huge_pte_none entries), by forcing a cow in the fault > path. This avoid take an addition fault to covert a read-only mapping > to read/write. Here we were comparing a recently instantiated pte (via > set_pte_at) to the pte values from linux page table. As explained above > on ppc64 such pte_same check returned wrong result, resulting in us > taking an additional fault on ppc64. > > Fixes: 6a119eae942c ("powerpc/mm: Add a _PAGE_PTE bit") > > Reported-by: Jan Stancek > Signed-off-by: Aneesh Kumar K.V > --- Acked-by: Hillf Danton > mm/hugetlb.c | 12 +++- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index ec49d9ef1eef..da8fbd02b92e 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3386,15 +3386,17 @@ static void unmap_ref_private(struct mm_struct *mm, > struct vm_area_struct *vma, > * Keep the pte_same checks anyway to make transition from the mutex easier. > */ > static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long address, pte_t *ptep, pte_t pte, > - struct page *pagecache_page, spinlock_t *ptl) > +unsigned long address, pte_t *ptep, > +struct page *pagecache_page, spinlock_t *ptl) > { > + pte_t pte; > struct hstate *h = hstate_vma(vma); > struct page *old_page, *new_page; > int ret = 0, outside_reserve = 0; > unsigned long mmun_start; /* For mmu_notifiers */ > unsigned long mmun_end; /* For mmu_notifiers */ > > + pte = huge_ptep_get(ptep); > old_page = pte_page(pte); > > retry_avoidcopy: > @@ -3668,7 +3670,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct > vm_area_struct *vma, > hugetlb_count_add(pages_per_huge_page(h), mm); > if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { > /* Optimization, do the COW without a second fault */ > - ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl); > + ret = hugetlb_cow(mm, vma, address, ptep, page, ptl); > } > > spin_unlock(ptl); > @@ -3822,8 +3824,8 @@ int hugetlb_fault(struct mm_struct *mm, struct > vm_area_struct *vma, > > if (flags & FAULT_FLAG_WRITE) { > if (!huge_pte_write(entry)) { > - ret = hugetlb_cow(mm, vma, address, ptep, entry, > - pagecache_page, ptl); > + ret = hugetlb_cow(mm, vma, address, ptep, > + pagecache_page, ptl); > goto out_put_page; > } > entry = huge_pte_mkdirty(entry); > -- > 2.10.1
[PATCH v2] powerpc/hash64: Be more careful when generating tlbiel
From: Balbir Singh In ISA v2.05, the tlbiel instruction takes two arguments, RB and L: tlbiel RB,L +-+-++-+-+-++ |31 |/| L |/|RB | 274 | / | | 31 - 26 | 25 - 22 | 21 | 20 - 16 | 15 - 11 | 10 - 1 | 0 | +-+-++-+-+-++ In ISA v2.06 tlbiel takes only one argument, RB: tlbiel RB +-+-+-+-+-++ |31 |/|/|RB | 274 | / | | 31 - 26 | 25 - 21 | 20 - 16 | 15 - 11 | 10 - 1 | 0 | +-+-+-+-+-++ And in ISA v3.00 tlbiel takes five arguments: tlbiel RB,RS,RIC,PRS,R +-+-++-+++-+-++ |31 |RS | / | RIC |PRS | R |RB | 274 | / | | 31 - 26 | 25 - 21 | 20 | 19 - 18 | 17 | 16 | 15 - 11 | 10 - 1 | 0 | +-+-++-+++-+-++ However the assembler also accepts "tlbiel RB", and generates "tlbiel RB,r0,0,0,0". As you can see above the L field from the v2.05 encoding overlaps with the reserved field of the v2.06 encoding, and the low bit of the RS field of the v3.00 encoding. Currently in __tlbiel() we generate two tlbiel instructions manually using hex constants. In the first case, for MMU_PAGE_4K, we generate "tlbiel RB,0", which is safe in all cases, because the L bit is zero. However in the default case we generate "tlbiel RB,1", therefore setting bit 21 to 1. This is not an actual bug on v2.06 processors, because the CPU ignores the value of the reserved field. However software is supposed to encode the reserved fields as zero to enable forward compatibility. On v3.00 processors setting bit 21 to 1 and no other bits of RS, means we are using r1 for the value of RS. Although it's not obvious, the code sets the IS field (bits 10-11) to 0 (by omission), and L=1, in the va value, which is passed as RB. We also pass R=0 in the instruction. The combination of IS=0, L=1 and R=0 means the value of RS is not used, so even on ISA v3.00 there is no actual bug. We should still fix it, as setting a reserved bit on v2.06 is naughty, and we are only avoiding a bug on v3.00 by accident rather than design. Use ASM_FTR_IFSET() to generate the single argument form on ISA v2.06 and later, and the two argument form on pre v2.06. Although there may be very old toolchains which don't understand tlbiel, we have other code in the tree which has been using tlbiel for over five years, and no one has reported any build failures, so just let the assembler generate the instructions. Signed-off-by: Balbir Singh [mpe: Rewrite change log, use IFSET instead of IFCLR] Signed-off-by: Michael Ellerman --- arch/powerpc/mm/hash_native_64.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) v2: Rewrite change log, use IFSET instead of IFCLR. diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c index 83ddc0e171b0..9d9b3eff123e 100644 --- a/arch/powerpc/mm/hash_native_64.c +++ b/arch/powerpc/mm/hash_native_64.c @@ -123,8 +123,9 @@ static inline void __tlbiel(unsigned long vpn, int psize, int apsize, int ssize) va |= ssize << 8; sllp = get_sllp_encoding(apsize); va |= sllp << 5; - asm volatile(".long 0x7c000224 | (%0 << 11) | (0 << 21)" -: : "r"(va) : "memory"); + asm volatile(ASM_FTR_IFSET("tlbiel %0", "tlbiel %0,0", %1) +: : "r" (va), "i" (CPU_FTR_ARCH_206) +: "memory"); break; default: /* We need 14 to 14 + i bits of va */ @@ -141,8 +142,9 @@ static inline void __tlbiel(unsigned long vpn, int psize, int apsize, int ssize) */ va |= (vpn & 0xfe); va |= 1; /* L */ - asm volatile(".long 0x7c000224 | (%0 << 11) | (1 << 21)" -: : "r"(va) : "memory"); + asm volatile(ASM_FTR_IFSET("tlbiel %0", "tlbiel %0,1", %1) +: : "r" (va), "i" (CPU_FTR_ARCH_206) +: "memory"); break; } -- 2.7.4
Re: [PATCH 6/7] powerpc: switch to using thin archives if COMPILE_TEST is set
Hi Nick, On Wed, 19 Oct 2016 14:15:59 +1100 Nicholas Piggin wrote: > > Enable thin archives build for powerpc when COMPILE_TEST is set. > Thin archives are explained in this commit: > > a5967db9af51a84f5e181600954714a9e4c69f1f This reference should be like this: a5967db9af51 ("kbuild: allow architectures to use thin archives instead of ld -r") -- Cheers, Stephen Rothwell
Re: [PATCH v2] powernv: Simplify searching for compatible device nodes
On Wed, 2016-08-10 at 19:32 -0500, Jack Miller wrote: > This condenses the opal node searching into a single function that > finds > all compatible nodes, instead of just searching the ibm,opal > children, > for ipmi, flash, and prd similar to how opal-i2c nodes are found. > Hi Michael, It seems this patch hasn't made it in. Without it linux cannot find the flash on p9 as skiboot has moved the node to a more appropriate place. Thanks, Cyril Cc: sta...@vger.kernel.org # v4.6+ > Signed-off-by: Jack Miller > --- > arch/powerpc/platforms/powernv/opal.c | 24 +++- > 1 file changed, 7 insertions(+), 17 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/opal.c > b/arch/powerpc/platforms/powernv/opal.c > index 8b4fc68..9db12ce 100644 > --- a/arch/powerpc/platforms/powernv/opal.c > +++ b/arch/powerpc/platforms/powernv/opal.c > @@ -631,21 +631,11 @@ static void __init opal_dump_region_init(void) > "rc = %d\n", rc); > } > > -static void opal_pdev_init(struct device_node *opal_node, > - const char *compatible) > +static void opal_pdev_init(const char *compatible) > { > struct device_node *np; > > - for_each_child_of_node(opal_node, np) > - if (of_device_is_compatible(np, compatible)) > - of_platform_device_create(np, NULL, NULL); > -} > - > -static void opal_i2c_create_devs(void) > -{ > - struct device_node *np; > - > - for_each_compatible_node(np, NULL, "ibm,opal-i2c") > + for_each_compatible_node(np, NULL, compatible) > of_platform_device_create(np, NULL, NULL); > } > > @@ -717,7 +707,7 @@ static int __init opal_init(void) > opal_hmi_handler_init(); > > /* Create i2c platform devices */ > - opal_i2c_create_devs(); > + opal_pdev_init("ibm,opal-i2c"); > > /* Setup a heatbeat thread if requested by OPAL */ > opal_init_heartbeat(); > @@ -752,12 +742,12 @@ static int __init opal_init(void) > } > > /* Initialize platform devices: IPMI backend, PRD & flash > interface */ > - opal_pdev_init(opal_node, "ibm,opal-ipmi"); > - opal_pdev_init(opal_node, "ibm,opal-flash"); > - opal_pdev_init(opal_node, "ibm,opal-prd"); > + opal_pdev_init("ibm,opal-ipmi"); > + opal_pdev_init("ibm,opal-flash"); > + opal_pdev_init("ibm,opal-prd"); > > /* Initialise platform device: oppanel interface */ > - opal_pdev_init(opal_node, "ibm,opal-oppanel"); > + opal_pdev_init("ibm,opal-oppanel"); > > /* Initialise OPAL kmsg dumper for flushing console on panic > */ > opal_kmsg_init();
Re: [PATCH 3/7] powerpc/64s: tool to flag direct branches from unrelocated interrupt vectors
On Wed, 19 Oct 2016 15:28:40 +1100 Balbir Singh wrote: > On 19/10/16 14:15, Nicholas Piggin wrote: > > Direct banches from code below __end_interrupts to code above > > __end_interrupts when built with CONFIG_RELOCATABLE are disallowed > > because they will break when the kernel is not located at 0. > > > > Sample output: > > > > WARNING: Unrelocated relative branches > > c118 bl-> 0xc0038fb8 > > c13c b-> 0xc01068a4 > > c148 b-> 0xc003919c > > c14c b-> 0xc003923c > > c5a4 b-> 0xc0106ffc > > c0001af0 b-> 0xc0106ffc > > c0001b24 b-> 0xc0106ffc > > c0001b58 b-> 0xc0106ffc > > > > Signed-off-by: Nicholas Piggin > > --- > > If this is the same script I gave you, you can add my SOB line as well I can't remember now, but I know you had something... I'm usually careful to get attributions right. I'll add your SOB.
[PATCH v4 1/5] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu pteempted check. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11
[PATCH v4 0/5] implement vcpu preempted check
change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Pan Xinhui (5): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, kvm: support vcpu preempted check arch/powerpc/include/asm/spinlock.h | 8 arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 include/linux/sched.h | 12 kernel/locking/mutex.c| 15 +-- kernel/locking/osq_lock.c | 10 +- kernel/locking/rwsem-xadd.c | 16 +--- 11 files changed, 105 insertions(+), 7 deletions(-) -- 2.4.11
[PATCH v4 2/5] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng Signed-off-by: Pan Xinhui --- kernel/locking/osq_lock.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..39d1385 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v4 3/5] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan Xinhui --- kernel/locking/mutex.c | 15 +-- kernel/locking/rwsem-xadd.c | 16 +--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..8927e96 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..ad0b5bb 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. vcpu_is_preempted is a macro +* defined by false if arch does not support vcpu preempted +* check +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v4 4/5] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is powerNV are built into same kernel image with pSeries. So we need return false if we are runnig as powerNV. The another fact is that lppaca->yiled_count keeps zero on powerNV. So we can just skip the machine type check. Suggested-by: Boqun Feng Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index abb6b0f..af4285b 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + #if defined(CONFIG_PPC_SPLPAR) /* We only yield to the hypervisor if we are in shared processor mode */ #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) -- 2.4.11
[PATCH v4 5/5] x86, kvm: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. We use one field of struct kvm_steal_time to indicate that if one vcpu is running or not. unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui --- arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 6 files changed, 50 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..b1c7937 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -98,6 +98,10 @@ struct pv_time_ops { unsigned long long (*steal_clock)(int cpu); }; +struct pv_vcpu_ops { + bool (*vcpu_is_preempted)(int cpu); +}; + struct pv_cpu_ops { /* hooks for various privileged instructions */ unsigned long (*get_debugreg)(int regno); @@ -318,6 +322,7 @@ struct pv_lock_ops { struct paravirt_patch_template { struct pv_init_ops pv_init_ops; struct pv_time_ops pv_time_ops; + struct pv_vcpu_ops pv_vcpu_ops; struct pv_cpu_ops pv_cpu_ops; struct pv_irq_ops pv_irq_ops; struct pv_mmu_ops pv_mmu_ops; @@ -327,6 +332,7 @@ struct paravirt_patch_template { extern struct pv_info pv_info; extern struct pv_init_ops pv_init_ops; extern struct pv_time_ops pv_time_ops; +extern struct pv_vcpu_ops pv_vcpu_ops; extern struct pv_cpu_ops pv_cpu_ops; extern struct pv_irq_ops pv_irq_ops; extern struct pv_mmu_ops pv_mmu_ops; diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..52fd942 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_vcpu_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..e9c12a1 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,8 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 preempted; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0011bef 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -488,6 +497,8 @@ void __init kvm_guest_init(void) kvm_guest_cpu_init(); #endif + pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; + /* * Hard lockup detection is enabled by default. Disable it, as guests * can get false positives too easily, for example if the host is diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index bbf3d59..7adb7e9 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -122,6 +122,7 @@ static void *get_call_destination(u8 type) struct paravirt_patch_template tmpl = {
Re: [PATCH v4 0/5] implement vcpu preempted check
On 10/19/2016 12:20 PM, Pan Xinhui wrote: > change from v3: > add x86 vcpu preempted check patch If you want you could add the s390 patch that I provided for your last version. I also gave my Acked-by for all previous patches. > change from v2: > no code change, fix typos, update some comments > change from v1: > a simplier definition of default vcpu_is_preempted > skip mahcine type check on ppc, and add config. remove dedicated macro. > add one patch to drop overload of rwsem_spin_on_owner and > mutex_spin_on_owner. > add more comments > thanks boqun and Peter's suggestion. > > This patch set aims to fix lock holder preemption issues. > > test-case: > perf record -a perf bench sched messaging -g 400 -p && perf report > > 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock > 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner > 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock > 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task > 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq > 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is > 2.49% sched-messaging [kernel.vmlinux] [k] system_call > > We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin > loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. > These spin_on_onwer variant also cause rcu stall before we apply this patch > set > > We also have observed some performace improvements. > > PPC test result: > > 1 copy - 0.94% > 2 copy - 7.17% > 4 copy - 11.9% > 8 copy - 3.04% > 16 copy - 15.11% > > details below: > Without patch: > > 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, > 1 samples) > 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, > 1 samples) > 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, > 1 samples) > 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, > 1 samples) > 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 > s, 1 samples) > > With patch: > > 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, > 1 samples) > 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, > 1 samples) > 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, > 1 samples) > 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, > 1 samples) > 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 > s, 1 samples) > > X86 test result: > test-case after-patch before-patch > Execl Throughput |18307.9 lps |11701.6 lps > File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps > File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps > File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps > Pipe Throughput| 11872208.7 lps | 11855628.9 lps > Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps > Process Creation |29881.2 lps |28572.8 lps > Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm > Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm > System Call Overhead | 10385653.0 lps | 10419979.0 lps > > Pan Xinhui (5): > kernel/sched: introduce vcpu preempted check interface > locking/osq: Drop the overload of osq_lock() > kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner > powerpc/spinlock: support vcpu preempted check > x86, kvm: support vcpu preempted check > > arch/powerpc/include/asm/spinlock.h | 8 > arch/x86/include/asm/paravirt_types.h | 6 ++ > arch/x86/include/asm/spinlock.h | 8 > arch/x86/include/uapi/asm/kvm_para.h | 3 ++- > arch/x86/kernel/kvm.c | 11 +++ > arch/x86/kernel/paravirt.c| 11 +++ > arch/x86/kvm/x86.c| 12 > include/linux/sched.h | 12 > kernel/locking/mutex.c| 15 +-- > kernel/locking/osq_lock.c | 10 +- > kernel/locking/rwsem-xadd.c | 16 +--- > 11 files changed, 105 insertions(+), 7 deletions(-) >